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Preface 


Introduction 

Data Analysis with Microsoft® Excei: Updated for Office 2007® harnesses 
the power of Excel and transforms it into a tool for learning basic statistical 
analysis. Stndents learn statistics in the context of analyzing data. We feel 
that it is important for stndents to work with real data, analyzing real-world 
problems, so that they nnderstand the subtleties and complexities of analy¬ 
sis that make statistics such an integral part of understanding our world. 
The data set topics range from business examples to physiological studies 
on NASA astronauts. Because students work with real data, they can appre¬ 
ciate that in statistics no answers are completely final and that intuition and 
creativity are as much a part of data analysis as is plugging numbers into 
a software package. This text can serve as the core text for an introductory 
statistics course or as a supplemental text. It also allows nontraditional stu¬ 
dents outside of the classroom setting to teach themselves how to use Excel 
to analyze sets of real data so they can make informed business forecasts 
and decisions. 

Users of this book need not have any experience with Excel, although 
previous experience would be helpful. The first three chapters of the book 
cover basic concepts of mouse and Windows operation, data entry, formulas 
and functions, charts, and editing and saving workbooks. Chapters 4 through 
12 emphasize teaching statistics with Excel as the instrument. 


Using Excel in a Statistics Course 

Spreadsheets have become one of the most popular forms of computer soft¬ 
ware, second only to word processors. Spreadsheet software allows the user 
to combine data, mathematical formulas, text, and graphics together in a 
single report or workbook. For this reason, spreadsheets have become indis¬ 
pensable tools for business, as they have also become popular in scientific 
research. Excel in particular has won a great deal of acclaim for its ease of 
use and power. 
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As spreadsheets have expanded in power and ease of use, there has heen 
increased interest in using them in the classroom. There are many advan¬ 
tages to using Excel in an introductory statistics course. An important ad¬ 
vantage is that students, particularly business students, are more likely to 
he familiar with spreadsheets and are more comfortahle working with data 
entered into a spreadsheet. Since spreadsheet software is very common at 
colleges and universities, a statistics instructor can teach a course without 
requiring students to purchase an additional software package. 

Having identified the strengths of Excel for teaching basic statistics, it 
would be unfair not to include a few warnings. Spreadsheets are not statistics 
packages, and there are limits to what they can do in replacing a full-featured 
statistics package. This is why we have included our own downloadable 
add-in, StatPlus™. It expands some of Excel’s statistical capabilities. (We 
explain the use of StatPlus where appropriate throughout the text.) Using 
Excel for anything other than an introductory statistics course would prob¬ 
ably not be appropriate due to its limitations. For example. Excel can easily 
perform balanced two-way analysis of variance but not unbalanced two-way 
analysis of variance. Spreadsheets are also limited in handling data with 
missing values. While we recommend Excel for a basic statistics course, we 
feel it is not appropriate for more advanced analysis. 


System Information 

You will need the following hardware and software to use Data Analysis 

with Microsoft® Excel: Updated for Office 2007®: 

• A Windows-based PC. 

• Windows XP or Windows Vista. 

• Excel 2007. If you are using an earlier edition of Excel, you will have to 
use an earlier edition of Data Analysis with Microsoft® Excel. 

• Internet access for downloading the software files accompanying the text. 

The Data Analysis with Microsoft® Excel package includes: 

• The text, which includes 12 chapters, a reference section for Excel’s 
statistical functions. Analysis ToolPak commands, StatPlus Add-In 
commands, and a bibliography. 

• The companion website at www.cengage.com/statistics/berk contains 
92 different data sets from real-life situations plus a summary of what 
the data set files cover, ten interactive Concept Tutorials, and installa¬ 
tion files for StatPlus—our statistical application. Chapter 1 of the text 
includes instructions for installing the files. 

• An Instructor’s Manual with solutions to all the exercises in the text is 
available, password-protected on the companion website, to adopting 
instructors. 
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Excel’s Statistical Tools 

Excel comes with 81 statistical functions and 59 mathematical fnnctions. 
There are also fnnctions devoted to hnsiness and engineering problems. The 
statistical fnnctions that basic Excel provides inclnde descriptive statistics 
snch as means, standard deviations, and rank statistics. There are also 
cnmnlative distribntion and probability density fnnctions for a variety of 
distribntions, both continnons and discrete. 

The Analysis ToolPak is an add-in that is inclnded with Excel. If yon 
have not loaded the Analysis ToolPak, yon will have to install it from your 
original Excel installation. 

The Analysis ToolPak adds the following capabilities to Excel: 

• Analysis of variance, including one-way, two-way withont replication, 
and two-way balanced with replication 

• Correlation and covariance matrices 

• Tables of descriptive statistics 

• One-parameter exponential smoothing 

• Histograms with nser-defined bin valnes 

• Moving averages 

• Random nnmber generation for a variety of distribntions 

• Rank and percentile scores 

• Mnltiple linear regression 

• Random sampling 

• t tests, inclnding paired and two sample, assnming eqnal and nneqnal 
variances 

• z tests 

In this book we make extensive nse of the Analysis ToolPak for mnltiple 
linear regression problems and analysis of variance. 


StatPlus™ 


Since the Analysis ToolPak does not do everything that an introdnctory sta¬ 
tistics conrse reqnires, this textbook comes with an additional add-in called 
the StatPlus™ Add-In that fills in some of the gaps left by basic Excel 2007 
and the Analysis ToolPak. 

Additional commands provided by the StatPlns Add-In give nsers the 
ability to: 

• Create random sets of data 

• Manipnlate data coinmns 

• Create random samples from large data sets 

• Generate tables of nnivariate statistics 
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• Create statistical charts including boxplots, histograms, and normal 
prohahility plots 

• Create quality control charts 

• Perform one-sample and two-sample t tests and z tests 

• Perform non-parametric analyses 

• Perform time series analyses, including exponential and seasonal 
smoothing 

• Manipulate charts hy adding data labels and breaking charts down into 
categories 

• Perform non parametric analyses 

• Create and analyze tabular data 

A full description of these commands is included in the Appendix’s 
Reference section and through on-line help available with the application. 


Concept Tutorials 

Included with the StatPlus add-in are ten interactive Excel tutorials that pro¬ 
vide students a visual and hands-on approach to learning statistical concepts. 
These tutorials cover: 

• Boxplots 

• Probability 

• Probability distributions 

• Random samples 

• Population statistics 

• The Central Limit Theorem 

• Confidence intervals 

• Hypothesis tests 

• Exponential smoothing 

• Linear regression 
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Chapter I 


Getting Started with Excel 

Objectives 

In this chapter you will learn to: 

P- Install StatPlus files 

P- Start Excel and recognize elements of the Excel workspace 

P- Work with Excel workbooks, worksheets, and chart sheets 

P- Scroll throngh the worksheet window 

P- Work with Excel cell references 

P- Print a worksheet 

P- Save a workbook 

P- Install and remove Excel add-ins 

P- Work with Excel add-ins 

P- Use the featnres of StatPlns 


I 


I n this chapter you’ll learn how to work with Excel 2007 in the 

Windows operating system. You’ll he introduced to basic workbook 
concepts, including navigating through your worksheets and work¬ 
sheet cells. This chapter also introduces StatPlus, an Excel add-in 
supplied with this book and designed to expand Excel’s statistical 
capabilities. 


Getting Started 

This book does not require prior Excel 2007 experience, but familiarity 
with basic features of that program will reduce your start-up time. This 
section provides a quick overview of the features of Excel 2007. If you 
are using an earlier version of Excel, you should refer to the text Data 
Analysis for Excel for Office XP. There are many different versions of 
Windows. This text assumes that you’ll be working with Windows Vista 
or Windows XP. 


Special Files forThis Book 

This book includes additional files to help you learn statistics. There are 
three types of files you’ll work with: StatPlus files. Explore workbooks, and 
Data (or Student) files. 

Excel has many statistical functions and commands. However, there are 
some things that Excel does not do (or does not do easily) that you will need 
to do in order to perform a statistical analysis. To solve this problem, this 
book includes StatPlus, a software package that provides additional statisti¬ 
cal commands accessible from within Excel. 

The Explore workbooks are self-contained tutorials on various statistical 
concepts. Each workbook has one or more interactive tools that allow you to 
see these concepts in action. 

The Data or Student files contain sample data from real-life problems. 
In each chapter, you’ll analyze the data in one or more Data file, employing 
various statistical techniques along the way. You’ll use other Data files in 
the exercises provided at the end of each chapter. 


Installing the StatPlus Files 

The companion website at www.cengage.com/statistics/berk contains an 
installation program that you can use to install StatPlus on your computer. 
Install your files now. 





To run the installation rontine: 


1 On the companion website click on the StatPlus link under the Book 
Resources section. 

2 Download the ZIP file containing the StatPlus files to your hard 
drive. 

3 Extract the ZIP file, which will contain a folder called StatPlus. 

4 Place the StatPlus folder in the desired location on your hard drive. 
If you want, you may rename this folder to a different name of your 
choice. 


The installation folder contains files arranged in three separate subfolders 
as shown in Figure 1-1. 


Figure l-l 
The Stat Plus 
folders 



Later in this chapter, you’ll learn how to access the StatPlus program from 
within Excel. 
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Excel and Spreadsheets 

Excel is a software program designed to help you evaluate and present infor¬ 
mation in a spreadsheet format. Spreadsheets are most often used hy busi¬ 
ness for cash-flow analysis, financial reports, and inventory management. 
Before the era of computers, a spreadsheet was simply a piece of paper with 
a grid of rows and columns to facilitate entering and displaying information 
as shown in Figure 1-2. 


Figure 1-2 
A sample 
Sales 
spreadsheet 

you add these 
numbers 
to get this 
number 


Blue Sky Airlines 


Sales Report 



Re9ion Jonuory 

February 

North 

' to rn 

13 400 

South 

22.100 

24050 

East 

U 270 

15670 

w«t 

lio.eoo 

2L900 


52 281 

74620 


Computer spreadsheet programs use the old hand-drawn spreadsheets 
as their visual model hut add a few new elements, as you can see from the 
Excel worksheet shown in Figure 1-3. 


Figure 1-3 
A sample 
spreadsheet 
as formatted 
within Excel 



However, Excel is so flexible that its application can extend beyond tradi¬ 
tional spreadsheets into the area of data analysis. You can use Excel to enter 
data, analyze the data with basic statistical tests and charts, and then create 
reports summarizing your findings. 
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Launching Excel 


When Excel 2007 is installed on yonr compnter, the installation program 
antomatically inserts a shortcnt icon to Excel 2007 in the Programs menn 
located nnder the Windows Start hntton. Yon can click this icon to lannch 
Excel. 

To start Excel: 

1 Click the Start hntton on the Windows Taskhar and then click All 

Programs. 

2 Click Microsoft Office and then click Microsoft Office Excel 2007 as 
shown in Fignre 1-4. 

Note: Depending on how Windows has heen confignred on yonr 
compnter, yonr Start menn may look different from the one shown 
in Fignre 1-4. Talk to yonr instrnctor if yon have problems lannch- 
ing Excel 2007. 


Figure 1-4 
Starting 
Excel 2007 
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Excel starts np, displaying the window shown in Fignre 1-5. 
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Viewing the Excel Window 

The Excel window shown in Fignre 1-5 is the environment in which yon’ll 
analyze the data sets nsed in this textbook. Yonr window might look differ¬ 
ent depending on how Excel has heen set np on yonr system. Before pro¬ 
ceeding, take time to review the varions elements of the Excel window. A 
qnick description of these elements is provided in Table 1-1. 


Table l-l Excel Elements 

Excel Element 

Active cell 
Cells 

Colnmn headings 


Purpose 

The cell cnrrently selected in the worksheet 
Stores individnal text or nnmeric entries 
Organizes cells into lettered colnmns 


(continued) 
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Excel ribbon 

A toolbar containing Excel commands broken down into 
different topical tabs 

Formnla bar 

Displays the formula or value entered into the currently 
selected cell 

Horizontal scroll bar 

Used to scroll through the contents of the worksheet in a 
horizontal direction 

Name box 

Displays the name or reference of the currently selected 
object or cell 

Office button 

Displays a menu of commands related to the operation and 
configuration of Excel and Excel documents 

Ribbon tab 

A tab containing Excel command bnttons for a particnlar 
topical area 

Row headings 

Organizes cells into numeric rows 

Sheet tabs 

Click to display individual worksheets 

Status bar 

Displays messages about current Excel operations 

Tab group 

A group of command buttons within a ribbon tab containing 
commands focused on the same set of tasks 

Title bar 

Displays the name of the application and the current Excel 
document 

Vertical scroll bar 

Used to scroll through the contents of the worksheet in a 
vertical direction 

Worksheet 

A collection of cells laid out in a grid where each cell can 
contain a single text or numeric entry 

Zoom controls 

Controls used to increase or decrease the magnification 
applied to the worksheet 


Running Excel Commands 

You can run an Excel command either by clicking the icons fonnd on the 
Excel ribbon or by clicking the Office bntton and then clicking one of the 
commands from the menu that appears. Figure 1-6 shows how you would 
open a file nsing the Open command available on the menn within the 
Office button. Note that some of the commands have keyboard shortcuts— 
key combinations that run a command or macro. For example, pressing the 
CTRL and keys simnltaneonsly will also run the Open command. 
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The menu commands below the Office button are used to set the proper¬ 
ties of your Excel application and entire Excel documents. If you want to 
work with the contents of a document you work with the commands found 
on the Excel ribbon. 

Each of the tabs on the Excel ribbon contains a rich collection of icons and 
buttons providing one-click access to Excel commands. Table 1-2 describes 
the different tabs available on the ribbon. 

Note that this list of tabs and groups will change on the basis of how Excel 
is being used by you. Excel, like other Office 2007 products, is designed to 
show only the commands which are pertinent to your current task. 


Table I -2 Excel Ribbon Tabs 


Ribbon tab 

Home 

Insert 

Page 

Layout 


Description 

Used to format the contents of worksheet 
cells 

Used to insert objects into an Excel 
workbook 

Used to format the printed version of the 
Excel workbook and to control how each 
worksheet appears in the Excel window 


Ribbon Groups 

Clipboard, Font, Alignment, 
Number, Styles, Cells, Editing 
Tables, Illustrations, Charts, 
Links, Text 

Themes, Page Setup, Scale to 
Fit, Sheet Options, Arrange 


(continued] 
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Formulas 

Used to insert formulas into a worksheet 
and to audit the effects of your formulas 
on cells values 

Function Library, Defined 
Names, Formula Auditing, 
Calculation 

Data 

Used to import data from different data 
sources and to group data values and 
perform what-if analysis on data 

Get External Data, 
Connections, Sort & Filter, 
Data Tools, Outline 

Review 

Used to proof the contents of a workbook 
and to manage the document in a workgroup 
environment involving several users 

Proofing, Comments, Changes 

View 

Controls the display of the Excel 
worksheet window including the ability 
to hide or display Excel elements 

Workbook Views, Show/ 

Hide, Zoom, Window, Macros 

Develop 

Contains tools used to add macros and other 
features to extend the capabilities of Excel 

Code, Controls, XML 

Add-Ins 

Contains user-define menus and tab 
groups created from add-ins (note that this 
tab will only appear when an add-in has 
been installed and activated.) 

various groups depending 
upon the add-ins being used. 


Each tab is broken up into different topical groups. For example the Home 
tab is broken into the following groups: Clipboard, Font, Alignment, Number, 
Styles, Cells, and Editing. When you are asked to run a command, you will 
be told which button to click from which tab group. For example, to copy the 
contents of a worksheet cell you would be given the following command: 


I Click the Copy button from the Clipboard group on the Home tab 
to copy the contents of the active cell. 


If you are asked to run a command using a keyboard shortcut, the keyboard 
combination will be shown in boldface with the keys joined by a plus sign to 
indicate that you should press these keys simultaneously. For example. 


I Press CTRL+n to create a new blank document. 


In addition to the Excel ribbon, you may occasionally see context- 
sensitive ribbons. These ribbons only appear when certain items are selected 
in the Excel document. For example, when you select an Excel chart. Excel 
will display a Chart ribbon containing a collection of tabs and tab groups 
designed for use with charts. 
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Excel Workbooks and Worksheets 


Excel documents are called workbooks. Each workbook is made up of individual 
spreadsheets called worksheets and sheets containing charts called chart sheets. 


Opening aWorkbook 


To learn some basic workbook commands, you’ll first look at an Excel work¬ 
book containing public-use data from Kenai Fjords National Park in Alaska. 
The data are stored in the Parks workbook, located in the ChapterOl sub¬ 
folder of the Data folder. Open this workbook now. 

To open the Park workbook: 

I Click the Office button ■ and then click Open from the Office menu. 

The Open dialog box appears as shown in Figure 1-7. Your dialog 
box will display a different folder and file list. 



Excel files 


file in Excel 


2 Locate the folder containing your ChapterOl data files. 

3 Double-click the Park workbook. 

Excel opens the workbook as shown in Figure 1-8. 
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Figure 1-8 
The Park 
workbook 
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Active sheet 


Sheet tabs 


A single workbook can have as many as 255 worksheets. The names of 
the sheets appear on tabs at the bottom of the workbook window. In the Park 
workbook, the first sheet is named Total Usage and contains information on 
the number of visitors at each location in the park over the previous year. 
The sheet shows both a table of visitor counts and a chart with the same in¬ 
formation. Note that the chart has been placed within the worksheet. Placing 
an object like a chart on a worksheet is known as embedding. Glancing over 
the table and chart, we see that the peak-usage months were May through 
September. 

The second tab is named Usage Chart and contains another chart of park 
usage. After the first two sheets are worksheets devoted to usage data from 
each month of the year. Your next task will be to move between the various 
sheets in the Park workbook. 


Scrolling through aWorkbook 

To move from one sheet to another, you can either click the various sheet 
tabs in the workbook or use the navigational buttons located at the bottom 
of the workbook window. Table 1-3 provides a description of these buttons. 
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Table 1-3 Workbook Navigation Buttons 


Button 

Image 

Purpose 

First sheet 

[iJ 

Scroll to the first sheet in the workbook 

Previons sheet 

A] 

Scroll to the previons sheet 

Next sheet 

iJ 

Scroll to the next sheet 

Last sheet 

Ik] 

Scroll to the last sheet in the workbook 


You can also move to a specific sheet by right clicking one of these navi¬ 
gation bnttons and selecting the sheet from the resnlting pop-np list of sheet 
names. Try viewing some of the other sheets in the workbook now. 

To view other sheets: 

1 Click the Usage Chart sheet tab. 

2 Excel displays the chart. Click anywhere within the chart to select 
it. See Fignre 1-9. 


Figure 1-9 
The Usage 
Chart sheet 
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Note that when you selected the chart, Excel displayed a new 
rihhon—the Chart Tools rihhon containing specific commands for 
working with charts. You’ll learn more about Excel charts and work¬ 
ing with this rihhon in Chapter 3. 

3 Click the Jan sheet tah. 

4 The worksheet for the month of January is displayed as shown in 
Figure 1-10. 


Figure I-10 
The Jan 
worksheet 
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Active sheet 


The form that appears in this worksheet resembles the form used by 
the Kenai Fjords staff to record usage information. It contains infor¬ 
mation on the park, the number of visits each month, visitor hours, 
and other important data. Some of these data are hidden beyond the 
boundary of the worksheet window. 

Drag the Vertical scrollbar down to move the worksheet down and 
view the rest of the January data. 


Chapter 1 Getting Started with Excel 13 




























































Clearly, the Park workbook is complex. Its sheets contain many pieces of 
information, mnch of it interrelated. This book will not cover all the tech- 
niqnes nsed to create a workbook like this one, bnt yon shonld be aware of 
the formatting possibilities that exist. 


Worksheet Cells 

Each worksheet can be thonght of as a grid of cells, where each cell can 
contain a nnmeric or text entry. Cells are referenced by their location on 
the grid. For example, the total nnmber of visitors at the park is shown in 
cell F17 of the Total Usage worksheet (see Fignre 1-11.) As yon’ll see later 
in Chapter 2, if yon were to nse this valne in a fnnction or Excel command, 
yon wonld nse the cell reference F17. 


Figure I -11 
Excel cell 
references 

cell address 
appears in the 
Name box 



Active cell 


Selecting a Cell 

When yon want to enter data or format a particnlar valne, yon mnst first 
select the cell containing the data or valne. To do this, yon simply click 
on the cell in the worksheet. Try this now with cell F17 in the Total Usage 
worksheet. 
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To select a cell from the worksheet: 

1 Click the Total Usage sheet tah to move hack to the front of the 
workbook. 

2 Click F17 in the worksheet grid. 


Cell F17 now has a small hox around it, indicating that it is the active 
cell (see Figure 1-11.) Moreover, when you selected cell F17, the Name hox 
displays F17 indicating that this is the active cell. Also, the formula har 
now displayed the formula =SUM(F5:F16). This formula calculates the sum 
of the values in cells F5 through F16. You’ll learn more about formulas in 
Chapter 2. 

If you want to select a group of cells, known as a cell range or range, you 
must select one corner of the range and then drag the mouse pointer over 
cells. To see how this works in practice, try selecting the usage table located 
in the cell range B4:F17 of the Total Usage worksheet. 

To select a cell range: 

1 Click B4. 

2 With the mouse button still pressed, drag the mouse pointer over to 
cell F17. 

3 Release the mouse button. 

Now the range of cells from B4 down to F17 is selected. Observe that a 
selected cell range is highlighted to differentiate it from unselected cells. 
A cell range selected in this fashion is always rectangular in shape and 
contiguous. If you want to select a range that is not rectangular or con¬ 
tiguous, you must use the CTRL key on your keyboard and then select the 
separate distinct groups that make up the range. For example, if you want 
to select only the cells in the range B4:B17 and F4:F17, you must use this 
technique. 

To select a noncontiguous range: 

1 Select the range B4:B17. 

2 Press the CTRL key on your keyboard. 
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3 With the CTRL key still pressed, select the range F4:F17. 
The selected range is shown in Fignre 1-12. 


Figure 1-12 
Noncontiguous 
cell range 



ranges B4:BI7 and 
F4:FI7 are selected 


The cell reference for this gronp of cells is B4:B17;F4:F17, where the 
semicolon indicates a joining of two distinct ranges. 


Moving Cells 

Excel allows yon to move the contents of yonr cells aronnd withont affect¬ 
ing their valnes. This is a great help in formatting yonr worksheets. To move 
a cell or range of cells, simply select the cells and then drag the selection 
to a new location. Try this now with the table of nsage data from the Total 
Usage worksheet. 

To move a range of cells: 

Select the range B4:F17. 

Move the monse pointer to the border of the selected area so that the 
pointer changes from a ‘•j* to a ■.'. 
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J Drag the selected area down two cells, so that the new range is now 
B6:F19, and release the monse hntton. 

Note that as yon moved the selected range, Excel displayed a screen 
tip with the new location of the range. 

4 Click F19 to deselect the cell range. 


When yon look at the formnla har for cell F19, note that the formnla is 
now changed from =SUM(F4:F17) to =SUM(F7:F18). Excel will antomati- 
cally npdate the cell references in yonr formnlas to acconnt for the fact that 
yon moved the cell range. 

Yon can also nse the Cnt, Copy, and Paste hnttons to move a cell range. 
These hnttons are essential if yon want to move a cell range to a new work- 
hook or worksheet (yon can’t nse the drag and drop techniqne to perform 
that action). Try nsing the Cnt and Paste method to move the table hack to 
its original location. 

To cut and paste a range of cells: 

1 Select the range B6:F19. 

2 Click the Cut button from the Clipboard group on the Home tab or 
press CTRL-t-x. 

A flashing border appears around the cell range, indicating that it 
has been cut or copied from the worksheet. 

3 Click B4. 

4 Click the Paste button from the Clipboard group on the Home tab 
or press CTRL-(-v. 

The table now appears back in the cell range, B4:F17. 

5 Click cell Al to make Al the active cell again. 


If you want to copy a cell range rather than move it, you can use the Copy 
button in the above steps, or if you prefer the drag and drop technique, 
hold down the CTRL key while dragging the cell range to its new location; 
this will create a copy of the original cell range at the new location. You can 
refer to Excel’s online Help for more information. 
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Printing from Excel 

It would be useful for the chief of interpretation at Kenai Fjords National 
Park to have a hard copy of some of the worksheets and charts in the Park 
workbook. To do this, yon can print ont selected portions of the workbook. 


Previewing the Print Job 

Before sending a job to the printer, it’s nsnally a good idea to preview the 
ontpnt. With Excel’s Print Preview window, yon can view yonr job before 
it’s printed, as well as set np the page margins, orientation, and headers and 
footers. Try this now with the Total Usage worksheet. 

To preview a print job: 

1 Verify that Total Usage is still the active worksheet. 

2 Click the Office bntton, then click Print, and then click Print Preview. 
The Print Preview opens as displayed in Fignre 1-13. 

Figure 1-13 
The Print 
Preview 
Window 

Print Preview 
tab 


Zoom controls to increase/decrease the 
magnification of the previewed document 
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Table 1-4 describes the variety of options available to you from the Print 
Preview tab in the Print Preview window. 


Table I -4 Print Preview Options 

Button 

Print 

Page Setup 
Zoom 
Next Page 
Previous Page 
Show Margins 
Close Print Preview 


Description 

Send the document to the printer 
Set up the properties of the printed page 
Zoom in and out of the Preview window 
View the next page in the print job 
View the previous page in the print job 
Display margins in the Preview window 
Close the Print Preview window 


Setting Up the Page 

The Preview window opens with the default print settings for the work¬ 
book. You can change these settings for each print job. You may add a 
header or footer to each page, change the orientation from portrait to land¬ 
scape, and modify many other features. To see how this works, adjust the 
settings for the current print job by adding a header and changing the page 
layout. 

To add a header to a print job: 

1 Click the Page Setup button from the Print Preview tab. 

2 Click the Header/Footer dialog sheet tab. 

3 Excel provides a list of built-in headers that you can select from 
the Header drop-down list. You can also write your own; you’ll do 
this now. 

4 Click the Custom Header button. 

5 Type Yearly Usage Report in the Center section of the Header dialog 
box as shown in Figure 1-14. 
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Figure 1-14 
Adding a header 
to the printed 
page 

Page Setup 
button 



6 Click the OK button. 


Because the print job is more horizontal than vertical, it would be a good 
idea to change the orientation from portrait to landscape. 

To change the page orientation: 

1 Click the Page dialog sheet tab within the Page Setup dialog box. 

2 Click the Landscape option button. 

3 Click the OK button. 

Figure 1-15 shows the new layout of the print job with a header and 
landscape orientation. 
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orientation 

4 Click the Close Print Preview button from the Preview group on the 
Print Preview tab to close the Preview window. 


There are many other printing features available to you in Excel. Check 
the online Help for more information. 


Printing the Page 

To print your worksheet, you can select the Print command from the Office 
menu. Try printing the Total Usage worksheet now. 

To print the Total Usage worksheet: 

I Click the Office button and then click Print from the Office 
menu. 

The Print Dialog box appears. See Figure 1-16. 
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Figure 1-16 
Print 
dialog box 

default printer- 


specify what parts 
of the workbook - 
to print 


-{ 


Print 
Printer 

Nante: Lexmark C510 PS 

Status: 

C510 PS (MS) 
Where: 192.168.0.2 

Comment: 


Print range 

Page(s) From: 
Print what 
'■ Selection 
® Active sheet(s) 

PI Ignore grint areas 


O satire worl6ook 


I ▼ [ [ Properties... 


P) Print to file 

Copies 

Number of copies: Q 


E Collate 


open the Print Preview 
window 


Notice that you can print a selection of the active worksheet (in other 
words, yon can select a cell range and print only that part of the work¬ 
sheet), the entire active sheet or sheets, or the entire workbook. Yon can 
also select the nnmher of copies to print and the range of pages. The 
other options let yon select yonr printer from a list (if yon have access 
to more than one) and set the properties for that particnlar printer. Yon 
can also click the Preview hntton to go to the Print Preview window. 

2 Click OK to start the print joh. 

Yonr printer shonld soon start printing the Total Usage worksheet. 


If yon were to hand this printont to the chief of interpretation of the park, 
he or she wonld he ahle to nse the information contained in it to determine 
when to hire extra help at the varions stations in the park. 


Saving Your Work 

Yon shonld periodically save yonr work when yon make changes to a work- 
hook or when yon are entering a lot of data so that yon won’t lose mnch 
work if yonr compnter or Excel crashes. Excel offers two options for saving 
yonr work: the Save command, which saves the file; and the Save As com¬ 
mand, which allows yon to save the file nnder a new name. 
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So that you do not change the original files (and can go throngh the chap¬ 
ters again with nnchanged files if necessary), yon’ll he instructed through¬ 
out this hook to save your work under new file names. To save the changes 
yon made to the Park workbook, save the file as Park Usage Report. If nsing 
yonr own compnter, yon can save the workbook to yonr hard drive. If yon 
are nsing a compnter on the school network, yon may be asked to save yonr 
work to yonr own floppy disk. This book assnmes that yon’ll save yonr work 
to the same folder containing the original data workbook. 

To save the Park workbook as Park Usage Report: 

1 Click the Office bntton and then click Save As from the Office 
menn to open the Save As dialog box. 

2 Navigate to and select the folder in which yon want to save the file, 
or save the file in the same folder as the Park workbook. 

3 Type Park Usage Report in the File name box. See Fignre 1-17. 


Figure 1-17 
Save As 
dialog 
box 


new workbook name 


Save As 


GG ^ 


disk ► Data ► ChapterOl 


’» [ I I Search 




click to save the workbook under a 
different document format 


4 Click OK. 
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Excel then saves the workbook under the name Park Usage Report. Note 
that if you can save the workbook under a variety of formats by clicking the 
Save as type list box and choosing a file type. 


Excel Add-Ins 

Excel’s capabilities can be expanded through the use of special programs 
called add-ins. These add-ins tie into Excel’s special features, almost look¬ 
ing like a part of Excel itself. Various add-ins that allow you to easily gen¬ 
erate reports, explore multiple scenarios, or access databases are supplied 
with Excel. To use these add-ins, you have to go through a process of saving 
the add-in files to a location on your computer, and then telling Excel where 
to find the add-in file. 

Excel comes with an add-in called Analysis ToolPak that provides some 
of the statistical commands you’ll need for this book. Another add-in, Stat- 
Plus, you have already copied to your hard disk. Now you will install the 
add-in in Excel. 


Loading the StatPlus Add-In 


The add-ins on your computer are stored in a list in Excel. From this list, 
you can activate the add-in or browse for new ones. First you’ll browse for 
the StatPlus add-in. 


1 

2 


To browse and install the StatPlus add-in: 

Click the Office button ^ and then click Excel Options located at 
the bottom of the pop-up menu. 

Click Add-Ins from the list of Excel options as shown in Figure 1 - 18 . 
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Figure 1-18 
Excel 
Options 
dialog box 



3 Click the Manage list box at the bottom of the window, select Excel 
Add-Ins and then click the Go button. 

The Add-Ins dialog box opens as shown in Figure 1-19. 
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Figure 1-19 
List of currently 
available add-ins 



data^ 


click to browse 
for an add-in file 


Each available add-in is shown in Fignre 1-19 along with a checkbox 
indicating whether that add-in is cnrrently loaded in Excel. 

4 Click the Browse bntton. 

5 Locate the installation folder on yonr hard drive where yon placed 
the StatPlns files, and open the folder. 

6 Open the Addins snbfolder. 

7 Click StatPlus.xla and click OK. 

StatPlns Version 3.0 now appears in the Add-Ins dialog box. If it is 
not checked, click the checkbox. See Fignre 1-20. 
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Figure 1-20 
The StatPlus 
add-in 



StatPlus add-in 
installed and activated 


8 Click the OK button. 

After clicking the OK bntton, the Add-Ins dialog box closes and 
a new tab named Add-Ins shonld be added to the Excel ribbon. 

9 Click the Add-Ins tab on the Excel ribbon and then click StatPlus 
from the Menn Commands gronp on the tab. 

The menn commands offered by StatPlns are shown in Fignre 1-21. 
YonTl have a chance to work with these commands later in the book. 
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Add-Ins tab 


Figure 1-21 
The StatPlus 
menu 


StatPlus menu 
commands 



Loading the Data Analysis Tool Pak 

Now that you’ve seen how to load the StatPlus add-in, you can load the Data 
Analysis ToolPak. Since the Data Analysis ToolPak comes with Excel, you 
may need to have your Excel or Office installation disks handy. 

To load the Data Analysis ToolPak: 

1 Click the Office button and then click Excel Options. Click Add- 
Ins from the Excel options dialog hox and then click Go next to the 
Manage Excel Add-Ins list hox. 

2 Click the checkbox for the Analysis ToolPak and click OK. 

3 At this point. Excel may prompt you for the installation CD; if so, 
insert the CD into your CD-ROM drive and follow the installation 
instructions. 
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When activated, the Data Analysis ToolPak appears in a nevr gronp named 
Analysis on the Data tah. View the Data Analysis ToolPak now. 

To access commands for the Data Analysis ToolPak: 

I Click the Data tah and then click Data Analysis from the Analysis 
gronp. 

The Data Analysis dialog hox appears as shown in Fignre 1-22. 


Figure 1-22 
Viewing the Data 
Analysis ToolPak 


The Analysis group is 
added to the Data tab 



Data Analysis ToolPak commands 


The list of commands available with the Data Analysis ToolPak is 
shown in the Analysis Tools list hox. To rnn one of these commands, 
select it from the list hox and then click the OK hntton. YonTl have 
an opportnnity to nse some of the Data Analysis ToolPak’s com¬ 
mands later in this hook. 

2 Click Cancel to close the Data Analysis dialog hox. 
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Unloading an Add-In 


If at any time, you want to unload the Data Analysis ToolPak or StatPlus, 
you can do so by returning to the list of available add-ins dialog box shown 
in Figure 1-19, and then deselecting the checkbox for the specific add-in. 
Unloading an add-in is like closing a workbook; it does not affect the add-in 
file. If you want to use the add-in again, simply reopen the Add-Ins dialog 
box and reselect the checkbox. If you exit Excel with an add-in loaded. Excel 
will assume that you want to run the add-in the next time you run Excel, so 
it will load it for you automatically. 


Features of StatPlus 

StatPlus has several special features that you should be aware of. These 
include modules and hidden data. 


Using StatPlus Modules 

StatPlus is made up of a series of add-in files, called modules. Each module 
handles a specific statistical task, such as creating a quality control chart 
or selecting a random sample of data. StatPlus will load the modules you 
need on demand (this way, you do not have to use up more system memory 
than needed). After using StatPlus for a while, you may have a great many 
modules loaded. If you want to reduce this number, you can view the list of 
currently opened modules and unload those you’re no longer using. 

To view a list of StatPlus modules: 

I Click Unload Modules from the StatPlus menu on the Add-Ins tab. 

StatPlus displays a list of loaded modules. A sample list is shown in 

Figure 1-23. Yours will be different. 
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Figure 1-23 
Viewing StatPlus 
modules 


click the checkbox to 
unload the module 


total size of all loaded 
modules 


If you want to unload all of the modules, click the Remove All 
checkbox. If you want to remove individual modules, click the 
checkbox in front of the module name. Once you unload a module, 
it’s removed from Excel, but it will be automatically reloaded the 
next time you try to use a command supported by the module. 

2 Click OK to close the Remove StatPlus Modules dialog box. 


Hidden Data 

Several StatPlus commands employ hidden worksheets. A hidden work¬ 
sheet is a worksheet in your workbook that is hidden from view. Hidden 
worksheets are used in creating histograms, boxplots, and normal probabil¬ 
ity plots (don’t worry, you’ll learn about these topics in later chapters). You 
can view these hidden worksheets if you need to troubleshoot a problem 
with one of these charts. There are three hidden worksheet commands in 
StatPlus (see Table 1-5; they are available in the General Utilities submenu). 

Table I -5 Hidden Woricsheet Commands 
Command Description 

View hidden data Unhides a hidden StatPlus worksheet 

Rehide hidden data Rehides a StatPlus worksheet 

Remove unlinked hidden data Removes extraneous data, like hidden data for a 

deleted chart, from the hidden worksheet 
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Linked Formulas 


Many of the StatPlus commands nse cnstom formnlas to calcnlate statistical 
and mathematical valnes. One advantage of these formnlas is that if the 
sonrce data in yonr statistical analysis is changed, the formnlas will reflect 
the changed data. A disadvantage is that if yonr workbook is moved to an¬ 
other compnter in which StatPlns is not installed (or installed in a different 
folder], those formnlas will no longer work. 

If yon decide to move yonr workbook to a new location, yon can freeze 
the data in yonr workbook, removing the cnstom formnlas bnt keeping the 
valnes. Once a formnla has been frozen, the valne will not be npdated if the 
sonrce data changes. A frozen workbook can be opened on other compnters 
rnnning Excel withont error. 

If the other compnter also has the StatPlns add-in installed bnt in a differ¬ 
ent folder, yon can still nse the cnstom formnlas by pointing the workbook 
to the new location of the StatPlns add-in file. Table 1-6 describes the vari- 
ons linked formnla commands (available in the StatPlns General Utilities 
snbmenn). 


Table I -6 Linked Formula Commands 


Command 

Resolve StatPlns links 

Freeze data in worksheet 
Freeze hidden data 
Freeze data in workbook 


Description 

Find the location of the StatPlns add-in on the cnrrent 
compnter and attach cnstom formnlas to the new location 
Freeze all data on the cnrrent worksheet 
Freeze all data on hidden worksheets 
Freeze all data in the cnrrent workbook 


Setup Options 

If yon want to control how StatPlns operates in Excel, yon can open the 
StatPlns Options dialog box from the StatPlns menn. The dialog box, shown 
in Fignre 1-24 is divided into the fonr dialog sheets: Inpnt, Ontpnt, Charts, 
Hidden Data. 


32 Excel 





Figure 1-24 
StatPlus 
Options 
dialog box 



The Input sheet allows you to specify the default method used for ref¬ 
erencing the data in yonr workbook. The two options are (1) nsing range 
names and (2) nsing range references. YonTl learn ahont range names and 
range references in Chapter 2. 

The Output sheet allows you to specify the default format for your output. 
You can choose between creating dynamic and static ontpnt. Dynamic ont- 
pnt nses cnstom formnlas which yon’ll have to adjnst if yon want to move 
yonr workbook to a new compnter. Static ontpnt only displays the ontpnt 
valnes and does not npdate if the inpnt data are changed. Yon can also 
choose the defanlt location for yonr ontpnt, from among (1) a cell on the cur¬ 
rent worksheet, (2) a new worksheet in the cnrrent workbook, or (3) a new 
workbook. 

The Charts sheet allows yon to specify the defanlt format for chart ontpnt. 
Yon can choose between creating charts as embedded objects in worksheets 
or as separate chart sheets. This will be discnssed in Chapter 3. 

The Hidden Data sheet allows you to specify whether to hide worksheets 
used for the background calculations involved in creating charts and statistical 
calculations. 

All of the options specified in the StatPlus Options dialog box are default 
options. You can override any of these options in a specific dialog box as 
you perform your analysis. 

You can learn more about StatPlus and its features by viewing the online 
Help file. Help buttons are included in every dialog box. You can also open 
the Help file by clicking About StatPlus from the StatPlus menu. 
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Exiting Excel 

When you are finished with an Excel session, you should exit the program 
so that all the program-related files are properly closed. 

To exit Excel: 

I Click the Office button and then click the Exit Excel button from 
the bottom of the menu. 


If you have unsaved work, Excel asks whether you want to save it before 
exiting. If you click No, Excel closes and you lose your work. If you click 
Yes, Excel opens the Save As dialog box and allows you to save your work. 
Once you have closed Excel, you are returned to the Windows desktop or to 
another active application. 
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Chapter 2 


Working with Data 

Objectives 


In this chapter you will learn to: 

P- Enter data into Excel from the keyboard 

P- Work with Excel formnlas and functions 

P- Work with cell references and range names 

P- Query and sort data using the AutoFilter and Advanced Filter 

P- Import data from text files and databases 
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I n this chapter you’ll learn how to enter data in Excel through the 
keyboard and hy importing data from text files and databases. You’ll 
learn how to create Excel formulas and functions to perform simple 
calculations. 

You’ll be introduced to cell references and learn how to refer to cell 
ranges using range names. Finally, you’ll learn how to examine your data 
through the use of queries and sorting. 


Data Entry 

One of the many uses of Excel is to facilitate data entry. Error-free data 
entry is essential to accurate data analysis. Excel provides several methods 
for entering your data. Data sets can be entered manually from the key¬ 
board or retrieved from a text file, database, or online Web sources. You 
can also have Excel automatically enter patterns of data for you, saving you 
the trouble of creating these data values yourself. You’ll study all of these 
techniques in this chapter, but first you’ll work on entering data from the 
keyboard. 

Entering Data from the Keyboard 

Table 2-1 displays average daily gasoline sales and other (nongasoline) 
sales for each of nine service station/convenience franchises in a store 
chain in a western city. There are three columns in this data set: Station, 
Gas, and Other. The Station column contains an id number for each of the 
nine stations. 

The Gas column displays the gasoline sales for each station. The Other 
column displays sales for nongasoline items. 


Table 2-1 Service Station Sales 


Station 

Gas 

Other 

1 

$8,415 

$7,211 

2 

$8,499 

$7,500 

3 

$8,831 

$7,899 

4 

$8,587 

$7,488 

5 

$8,719 

$7,111 

6 

$8,001 

$6,281 

7 

$9,567 

$13,712 

8 

$9,218 

$12,056 

9 

$8,215 

$7,508 
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To practice entering data, you’ll insert this information into a blank work¬ 
sheet. To enter data, you first select the cell corresponding to the upper left 
corner of the table, making it the active cell. You then type the value or text 
you want placed in the cell. To move between cells, you can either press the 
Tab key to move to the next column in the same row or press the Enter key 
to move to the next row in the same column. If you are entering data into 
several columns, the Enter key will move you to the next row in the first 
column of the data set. 


Figure 2-1 
The first 
rows of 
the service 
station 
data set 


To enter the first row of the service station data set: 

1 Launch Excel as described in Chapter 1. 

Excel shows an empty workbook with the name Bookl in the title bar. 

2 Click cell Al to make it the active cell. 

3 Type Station and then press Tab. 

4 Type Gas in cell Bl and press Tab. 

5 Type Other in cell Cl and press Enter. 

Excel moves you to cell A2, making it the active cell. 

6 Using the same technique, type the next two rows of the table, so 
that data for the first two stations are displayed. Your worksheet 
should appear as in Figure 2-1. 



Entering Data with Autofill 

If you’re inserting a column or row of values that follow some sequential 
pattern, you can save yourself time by using Excel’s Autofill feature. The Au¬ 
tofill feature allows you to fill up a range of values with a series of numbers 
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or dates. You can use Autofill to generate automatically columns containing 
data values such as: 

1, 2, 3, 4, . . . 9, 10; 

1, 2, 4, 8, . . . 128, 256; 

Jan, Feb, Mar, Apr, . . . , Nov, Dec; 
and so forth. 

In the service station data, you have a sequence of numbers, 1-9, that 
represent the service stations. You could enter the values by hand, but this 
is also an opportunity to use the Autofill feature. 

To use Autofill to fill in the rest of the service station numbers: 

1 Select the range A2:A3. 

Notice the small black box at the lower right corner of the double 
border around the selected range. This is called a fill handle. To cre¬ 
ate a simple sequence of numbers, you’ll drag this fill handle over a 
selected range of cells. 

2 Move the mouse pointer over the fill handle until the pointer 
changes from a to a +. Click and hold down the mouse button. 

3 Drag the fill handle down to cell AlO and release the mouse button. 

Note that as you drag the fill handle down, a screen is displayed 
showing the value that will be placed in the active cell if you release 
the mouse button at that point. 

4 Figure 2-2 shows the service station numbers placed in the cell 
range A2:A10. 


Figure 2-2 
Using 
Autofill to 
insert a 
sequence of 
data values 

drag the fill 
handle down to 
generate a linear 
sequence of 
numbers 
automatically 
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EXCEL TIPS_ 

i%z "If you want to create a geometric sequence of numbers, drag the 
fill handle with your right mouse button and then select Growth 
Trend from the pop-up menu. 

• If you want to create a customized sequence of numbers or 
dates, drag the fill handle with your right mouse button and 
select Series... from the pop-up menu. Fill in details about 
your customized series in the Series dialog box. 


With the service station numbers entered, you can add the rest of the sales 
figures to complete the data set. 

To finish entering data: 

1 Select the range B4:C10. 

2 With B4 the active cell, start typing in the remaining values, using 
Table 2-1 as your guide. 

Note that when you’re entering data into a selected range, pressing 
the Tab key at the end of the range moves you to the next row. 

3 Click cell Al to remove the selection. 

The completed worksheet should appear as shown in Figure 2-3. 


Figure 2-3 
The 
completed 
service 
station 
data 
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Inserting New Data 


Sometimes you will want to add new data to yonr data set. For example, 
yon discover that there is a tenth service station with the following 
sales data: 

Table 2-2 Additional Service Station Sales 

Station Gas Other 

0 $8,995 $6,938 


Yon conld simply append this information to the table yon’ve already 
created, covering the cell range All:Cll. On the other hand, in order to 
maintain the seqnential order of the station nnmhers, it might he better to 
place this information in the range A2:C2 and then have the other stations 
shifted down in the worksheet. Yon can accomplish this nsing Excel’s Insert 
command. 

To insert new data into your worksheet: 

Select the cell range A2:C2. 

Right-click the selected range and then click Insert from the pop-up 
menu. See Figure 2-4. 


1 

2 


Figure 2-4 
Running 
the Insert 
command 
from the 
shortcut 
menu 
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3 Verify that the Shift cells down option button is selected. 

4 Click OK. 

Excel shifts the values in cells A2:C10 down to A3:Cll and inserts a 
new blank row in the range A2:C2. 

5 Enter the data for Station 0 from Table 2-2 in the cell range A2:C2. 

6 Click Al to make it the active cell. 


Data Formats 

Now that you’ve entered your first data set, you’re ready to work with 
data formats. Data formats are the fonts and styles that Excel applies 
to your data’s appearance. Formats are applied to either text or numbers. 
Excel has already applied a currency format to the sales data you’ve entered. 
For example, if you click cell B2, note that the value in the formula bar is 
8995, but the value displayed in the cell is $8,995. The extra dollar sign and 
comma separator are aspects of the currency format. You can modify this 
format if you wish by inserting additional digits to the value shown in the 
cell (for example, $8,995.00). You may do this if you want dollars and cents 
displayed to the user. 

For the text displayed in the range Al:Cl, Excel has applied a very basic 
format. The text is left justified within its cell and displayed in 11-point 
Calibri font (depending on how Excel has been configured on your system, 
a different font or font size may be used). You can modify this format 
as well. 

Try it now, applying a boldface font to the column titles in Al:Cl. In 
addition, center each column title within its cell. 

To apply a boldface font and center the column titles: 

1 Select the range Al:Cl. 

2 Click the Bold button ® from the Font group on the Home tab. 

3 Click the Center button * from the Alignment group on the 

Home tab. 

4 Click cell Al to remove the selection. 

Your data set should look like Figure 2-5. 
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column titles are centered within each 
column and displayed in a bold font 


Figure 2-5 
Applying a 
boldface 
font and 
centering 
the column 
titles 
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The Bold and Center buttons on the Home tab give you one-click access 
to two of Excel’s popular formatting commands. Other format buttons are 
shown in Table 2-3. 


Table 2-3 Data Format Buttons 


Button 

Icon Purpose 

Font 


Apply the font type. 

Font Size 

u 

Change the size of the font (in points). 

Bold 

■ 

Apply a boldface font. 

Italic 

A 

Apply an italic font. 

Underline 

u 

Underline the selected text. 

Align Left 

w 

Left-justify the text. 

Align Center 

m 

Center the text. 

Align Right 

m 

Right-justify the text. 

Percent Style 

% 

Display values as percents (i.e., 0.05 = 5%). 

Currency Style 

$ - 

Display values as currency (i.e., 5.25 = $5.25) 


(continued) 
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Comma Style 

Increase Decimal 

Decrease Decimal 

Fill Color 
Font Color 
Merge and Center 


Add comma separators to valnes 
(i.e., 43215 = 43,215). 

Increase the nnmber of decimal points 
(i.e., 4.3 = 4.300). 

Decrease the nnmher of decimal points 
(i.e., 4.321 = 4.3). 

Change the cell’s hackgronnd color. 

Change the color of the selected text. 

Merge the selected cells and center the text across 
the merged cells. 


Yon can access all of the possible formatting options for a particnlar cell 
by opening the Format Cells dialog box. To see this featnre of Excel, yon’ll 
nse it to continne formatting the colnmn titles, changing the font color 
to red. 

To open the Format Cells dialog box: 

1 Select the cell range, Al:Cl. 

2 Right-click the selection and click Format Cells from the shortcnt 
menn. 

The Format Cells dialog box contains six dialog sheets labeled 

Nnmber, Alignment, Font, Border, Patterns, and Protection. Each 
deals with a specific aspect of the cell’s appearance or behavior in 
the workbook. Yon’ll first change the font color to red. This option is 
located in the Font dialog sheet. 

3 Click the Font tab. 

4 Click the Color drop-down list box and click the Red checkbox 
(located as the second entry in the list of standard colors.) See 
Fignre 2-6. 
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Figure 2-6 
Changing 
the font 
color to red 



5 Click OK to close the Format Cells dialog box. 

6 Click cell Dl to unselect the cells. 

Figure 2-7 displays the final format of the column titles. 
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Figure 2-7 
Formatted 
column 
titles 
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Before going further, this would he a good time to save your work. 

To save your work: 

1 Click the Office hutton and then click Save. 

2 Save the workbook as Gas Sales Data in your student folder. 


Formulas and Functions 

Not all of the values displayed in a workbook come from data entry. Some 
values are calculated using formulas and functions. A formula always begins 
with an equals sign ( = ) followed by a function name, number, text string, 
or cell reference. Most functions contain mathematical operators such as + 
or —. A list of mathematical operators is shown in Table 2-4. 
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Table 2-4 Mathematical Operators 


Operator 

+ 

/ 

* 

A 


Description 

Addition 

Subtraction 

Division 

Multiplication 

Exponentiation 


Inserting a Simple Formula 

To see how to enter a simple formula, add a new column to your data set 
displaying the total sales from both gasoline and other sources for each of 
the ten service stations. 

/ To add a formula: 

1 Type Total in cell Dl and press Enter. 

2 Type =b2+c2 in cell D2 and press Enter. 

Note that cells B2 and C2 contain the gas and other sales for 
Station 0. 

The value displayed in D2 is $15,933—the sum of these two values. 

At this point you could enter formulas for the remaining cells in 
the data set, but it’s quicker to use Excel’s Autofill capability to add 
those formulas for you. 

3 Click cell D2 to make it the active cell. 

4 Click the fill handle and drag it down to cell Dll. Release the mouse 
button. 

Excel automatically inserts the formulas for the cells in the range 
D3:Dll. Thus, the formula in cell Dll is =bll+cll, to calculate the 
total sales for Station 9. Note that Excel has also applied the same 
currency format it used for the values in column B and C to values 
in column D. See Figure 2-8. 
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Figure 2-8 
Adding new 
formulas 
with Autofill 


A A 


1 


(>M 

2 

0 

SA.995 

3 

1 

SMU 

4 

2 

SA.499 

S 

1 

S».«ll 

4 

4 

S4.307 

7 

5 

SA.71$ 

• 

A 

S0.001 

9 

7 

99.J47 

10 

■ 

S9.2U 

11 

9 

SS.213 

u 



13 




C 

0 

Othrf 

total 

S6.93S 

$15,931 

&7.211 

$15.62» 

$7,SOO 

$15,999 

$7,999 

$16,730 

$7AI» 

$14.07' 

$7,111 

$15,910 

$4,291 

$14.2*. 

$11,712 

523.* ^ 

$12. OM 

$21.2 •: 

$7,»« 

$15,723 


IZIZI 


formulas automatically 
entered by Excel 


This example illustrates a simple formula involving the addition of two 
numbers. What if you wanted to find the total gas and other sales for all ten 
of the service stations? In that case, yon would he better off using one of 
Excel’s built-in functions. 


Inserting an Excel Function 

Excel has a library containing hundreds of fnnctions covering most finan¬ 
cial, statistical, and mathematical needs. Users can also create their own 
cnstom functions using Excel’s programming langnage. StatPlns contains 
its own library of functions, snpplementing those offered by Excel. A list 
of statistics-related functions is inclnded in the Appendix at the end of 
this book. 

A fnnction is composed of the function name and a list of arguments— 
values required by the function. For example to calculate the sum of a set of 
cells, you would use the SUM function. The general form or syntax of the 
SUM function is 


= SUM(numberl, number2, . . . ) 

where numberl and number2 are numbers or cell references. Note that the 
SUM function allows multiple numbers of cell references. Thus to calculate 
the sum of the cells in the range B2:Bll, you could enter the formula 

= SUM(B2:B11). 
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Although you can type in fnnctions directly, yon may find it easier to nse 
the commands located in the Fnnction Library gronp on the Formulas tah. 
These commands provide information on the parameters reqnired for cal- 
cnlating the fnnction valne as well as giving one-click access to online help 
regarding each fnnction. 

To calculate total sales figures for all ten service stations: 

1 Type Total in cell Al3 and press Tab. 

2 Click the Math & Trig hntton located in the Fnnction Library gronp 
on the Formnlas tab. 

Excel displays a scroll box listing all of the Excel fnnctions related 
to mathematics and trigonometry. 

3 Scroll down the scroll box and click SUM from the list as shown in 
Fignre 2-9. 


Figure 2-9 
Accessing 
Math & Trig 
functions 



list of Math & Trig 
functions 


Next, Excel displays a dialog box with the arguments for the SUM 
fnnction. Excel has already inserted the cell reference B2:B12 for 
yon in the first argnment, bnt if this is not the reference yon want, 
yon can select a different one yonrself. Try this now. 
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Figure 2-10 
Adding 
arguments 
to the SUM 
function 

cells used 
to calculate 
gasoline sales 


4 

5 

6 


Click the Collapse Dialog button Hfil next to the nnmberl argnment. 
Drag yonr monse pointer over the range 
Click the Restore Dialog bntton IHI. 

The cell range is entered into the nnmberl argnment. See 

Fignre 2-10. 



7 Click OK. 


The total gasoline sales valne of $87,047 is now displayed in cell B13. Yon 
can easily add total sales for other individnal prodncts and for all prodncts 
together nsing the same Antofill techniqne nsed earlier. 

To add the remaining total sales calcnlations: 

1 Select B13. 

2 Click the fill handle and drag it to cell D13. 

3 Release the monse bntton. 

Total sales fignres are now shown in the range B13:D13. See 
Fignre 2-11. 
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Figure 2-11 
All sales 
totals 
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Cell References 

When Excel calculated the total sales for column C and column D on 
your worksheet, it inserted the following formulas into C13 and D13, 
respectively: 

= SUM(C2:C11) 

and 

= SUM(D2:D11) 

At this point you may wonder how Excel knew to copy everything except 
the cell reference from cell B13 and, in place of the original B2:Bll refer¬ 
ence, to shift the cell reference one and two columns to the right. Excel does 
this automatically when you use relative references in your formulas. A rel¬ 
ative reference identifies a cell range on the basis of its position relative to 
the cell containing the formula. One advantage of using relative references, 
as you’ve seen, is that you can fill up a row or column with a formula and 
the cell references in the new formulas will shift along with the cell. 

Now what if you didn’t want Excel to shift the cell reference when you 
copied the formula into other cells? What if you wanted the formula always 
to point to a specific cell in your worksheet? In that case you would need 
an absolute reference. In an absolute reference, the cell reference is prefixed 
with dollar signs. For example, the formula 
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=SUM($C$2:$C$11) 


is an absolute reference to the range C2:Cll. If you copied this formula into 
other cells, it would still point to C2:Cll and would not be shifted. 

You can also create formulas that use mixed references, combining both 
absolute and relative references. For example, the formulas 

=SUM($C2:$C11) 

and 


= SUM(C$2:C$11) 

use mixed references. In the first example, the column is absolute but 
the row is relative, and in the second example, the column is relative 
but the row is absolute. This means that in the first example. Excel will shift 
the row references but not the column references, and in the second exam¬ 
ple, Excel will shift the column references but not the row references. You 
can learn more about reference types and how to use them in Excel’s online 
Help. In most situations in this book, you’ll use relative references, unless 
otherwise noted. 


Range Names 

Another way of referencing a cell in your workbook is with a range name. 
Range names are names given to specific cells or cells ranges. For ex¬ 
ample, you can define the range name Gas to refer to cells B2:Bll in 
your worksheet. To calculate the total gasoline sales, you could use the 
formula 

= SUM(B2:B11] 
or 

=SUM(Gas). 

Range names have the advantage of making your formulas easier to 
write and interpret. Without range names you would have to know some¬ 
thing about the worksheet before you could determine what the formula 
=SUM(B2:Bll] calculates. 

Excel provides several tools to create range names. You’ll find it easier to 
perform data analysis on your data set if you’ve defined range names for all 
of the columns. A simple way to create range names is to select the range 
of data including a row or column of titles. You can then use the titles from 
the worksheet to define the range name. Try it now with the service station 
data. 
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Figure 2-12 
Creating 
range names 
from a 
selection 


To create range names for the service station data: 

1 Select the range Al:Dll. 

2 Click the Create from Selection button located in the Defined Names 
group on the Formulas tah. 

Excel will create range names based on where you have entered the 
data labels. In this case, you’ll use the labels you entered in the top 
row as the basis for the range names. 

3 Verify that the Top row checkbox is selected as shown in 
Figure 2-12. 



4 Click OK. 


Four range names have been created for you: Station, Gas, Other, 
and Total. You can use ExceFs Name Box to select those ranges 
automatically. 


1 

2 


To select the Total range: 

Click the Name Box (the drop-down list box) located directly above 
and to the left of the worksheet’s row and column headers. 

Click Total from the Name Box. 

The cell range D2:Dll is automatically selected. See Figure 2-13. 
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All of the workbooks you’ll use in this book will contain range names for 
each of their data colnmns. 

EXCEL TIPS_ 

^ - • Another way to create range names is by first selecting the cell 

range and then typing the range name directly into the Name Box. 

• Yon can view and organize all of the range names in the cnrrent 
workbook by clicking the Name Manager bntton located in the 
Defined Names gronp on the Formnlas tab. 

• Range names have a property called scope that determines 
where they are recognized in the workbook. Scope can be lim¬ 
ited to the cnrrent worksheet only, allowing yon to dnplicate 
the same range name on different worksheets. If yon wish to 
reference that a range name with a scope limited to a particnlar 
worksheet, yon’ll have to specify which worksheet yon want to 
nse. For example, yon mnst nse the reference ‘Sheet I’lGas for 
the cell range “Gas” located on the Sheet 1 worksheet. 

• Yon replace cell references with their range names by clicking 
the Define Name list bntton from the Defined Names gronp on 
the Formnlas tab and then selecting Apply Names from the list 
box. Yon will then be prompted to apply the already-defined 
range names to formnlas in the workbook. 
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Sorting Data 

Once you’ve entered your data into Excel, yon’re ready to start analyzing 
it. One of the simplest analyses is to determine the range of the data valnes. 
Which valnes are largest? Which are smallest? To answer qnestions of this 
type, yon can nse Excel to sort the data. For example, yon can sort the gas 
station data in descending order, displaying first the station that has shown 
the greatest total revenne down throngh the station that has had the lowest 
revenne. Try this now with the data yon’ve entered. 

To sort the data by Total amount: 

1 Select the cell range Al:Dll. 

The range Al:Dll contains the range yon want to sort. Note that yon 
do not inclnde the cells in the range A13:D13, hecanse these are the 
colnmn totals and not individnal service stations. 

2 Click the Sort & Filter hntton located in the Editing gronp on the 
Home tah and then click Custom Sort. 

Excel opens the Sort dialog hox. From this dialog hox yon can select 
mnltiple sorting levels. Yon can sort each level in an ascending or 
descending order. 

3 Click the Sort by list box and select Total from the list range names 
fonnd in the selected worksheet cells. 

4 Click Largest to Smallest from the Order list box and shown in Figure 2-14. 
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5 Click the OK button. 

6 Deselect the cell range by clicking cell Al. 

The stations are now sorted in order from Station 7, showing the 
largest total revenne, to Station 6 with the lowest revenne. See 
Fignre 2-15. 


Figure 2-15 
The gas 
station data 
sorted in 
descending 
order to 
Total 
revenue 
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EXCEL TIPS 



If yon want to sort yonr list in nonnnmeric order (in terms of 
days of the week or months of the year], click Custom List from 
the Order list box and then select one of the cnstom lists defined 
for yonr workbook. 

To sort yonr data by mnltiple levels, click the Add Level bntton 
in the Sort dialog box and then specify the data valnes corre¬ 
sponding to the next level. 


Querying Data 

In some cases yon may be interested in a snbset of yonr data rather than in 
the complete list. For instance, a mannfactnring company trying to analyze 
qnality control data from three work shifts might be interested in looking 
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only at the night shift. A firm interested in salary data might want to con¬ 
sider just the subset of those making between $55,000 and $85,000. Excel 
allows you to specify the criteria for creating these subsets in the following 
two ways: 

• Comparison criteria, which compares data values to specified values or 
constants 

• Calculated criteria, which compares data values to a calculated value 

For the gas station data, an example of a comparison criterion would be one 
that determines which service stations have gas sales exceeding $5,000. On 
the other hand, a calculated criterion would be one that determines which 
service stations have gas sales that exceed the average of gas sales of all sta¬ 
tions in the data sample. 

Once you have determined your criteria for creating a subset of the data 
values, you select worksheet cells that fulfill these criteria by filtering 
or querying the data. Excel provides two ways of filtering data. The first 
method, called the AutoFilter, is primarily used for simple queries employ¬ 
ing comparison criteria. For more complicated queries and those involving 
calculated values. Excel provides the Advanced Filter. You’ll have a chance 
to use both methods in exploring the gas station data. 


Using the AutoFilter 

Let’s say the service station company plans a massive advertising campaign 
to boost sales for the service stations that are reporting gas sales of less than 
$8,500. You can construct a simple query using comparison criteria to have 
Excel display only service stations with gas sales <$8,500. 

To query the service station list: 

1 Click the Sort & Filter button from the Editing group on the Home 
tab and then click Filter from the drop-down menu. 

Excel adds drop-down arrows to each of the column titles in the 
data list. By clicking these drop-down arrows you can filter the data 
list on the basis of the values in the selected column. 

2 Click the Gas drop-down arrow to display the shortcut menu. Click 
Number Filters and then Less Than as shown in Figure 2-16. 
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Figure 2-16 
Filtering 
data values 



Excel opens the Custom AutoFilter dialog box. From this dialog box 
you can specify the criteria used to filter the values in the data list. 

3 Type 8500 in the input box as shown in Figure 2-17. 


Figure 2-17 
Creating a 
filter for 
gas sales 
less than 
$8,500 
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Figure 2-18 
Stations 
with 
gas sales 
< $8,500 


4 Click OK. 

Excel modifies the list of service stations to show stations 1, 2, 6, 
and 9. See Fignre 2-18. 



The service station data for the other stations has not heen lost, merely 
hidden. Yon can retrieve the data hy choosing the All option from the Gas 
drop-down list. 

Let’s say that yon need to add a second filter that also filters ont those 
service stations selling less than $7,500 worth of other prodncts. This filter 
does not negate the one yon jnst created; it adds to it. 

To add a second filter: 

1 Click the Other drop-down filter arrow, click Number Filters and 
then click Less Than Or Equal to. 

2 Type 7500 in the Cnstom AntoFilter dialog hox. 

3 Click OK. 

Excel rednces the nnmher of displayed stations to Stations 1, 2, and 6. 


Stations 1, 2, and 6 are the only stations that have <$8,500 in gasoline 
sales and <= $7,500 in other sales. Combining filters in this way is known 
as an And condition because only stations that fulfill both criteria are 
displayed. 

You can also create filters using Or conditions in which only one of the 
criteria must be true. 

To remove the AntoFilter from your data set, you can either stop running 
the AntoFilter or remove each filter individually. Try both methods now. 
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To remove the filters: 


1 Click the Other drop-down filter arrow and click Clear Filter from 
“Other”. 

The second filter is removed, and now only the resnlts ot the first 
filter are displayed. 

2 Click the Sort & Filter hntton from the Editing gronp on the Home 
tah and then click Filter from the drop-down menn. 


Excel stops rnnning AntoFilter altogether and removes the filter drop¬ 
down arrows from the worksheet. 


Using the Advanced Filter 

There might he sitnations where yon want to nse more complicated criteria 
to filter yonr data. Snch sitnations inclnde criteria that 

• Reqnire several And/Or conditions 

• Involve formnlas and fnnctions 

Snch cases are often heyond the capability of Excel’s AntoFilter, hnt yon 
can still do them nsing the Advanced Filter. To nse the Advanced Filter, 
yon mnst first enter yonr selection criteria into cells on the worksheet. 
Once those criteria are entered, yon can nse them in the Advanced Filter 
command. 

Try this techniqne hy recreating the pair of criteria yon jnst entered; only 
now yon’ll nse Excel’s Advanced Filter. 

To create a query for use with the Advanced Filter: 

1 Click cell B15, type Advanced Filter Criteria, and press Enter. 

2 Type Gas in cell B16 and press Tah. Type Other in cell C16 and 
press Enter. 

3 Type < 8500 in cell B17 and press Tah. Type <= 7500 in cell C17 
and press Enter. 


If two criteria occupy the same row in the worksheet. Excel assumes that 
an And condition exists between them. In the example you just typed in, 
both criteria were entered into row 17, and Excel assumed that you wanted 
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gas sales < $8,500 and other sales <= $7,500. Thus, these criteria match 
what you created earlier using the AutoFilter. Now apply these criteria to 
the service station data. To do this, open the Advanced Filter dialog hox and 
specify hoth the range of the data yon want filtered and the range containing 
the filter criteria. 


To run the Advanced Filter command: 

1 Select the cell range Al:Dll. 

2 Click the Advanced button from the Sort & Filter group on the Data 
tah. Excel opens the Advanced Filter dialog hox. 

3 Make sure that the Filter the list, in-place option button is selected 
and that $A$1:$D$11 is displayed in the List range box. 

4 Enter B16:C17 in the Criteria range box. This is the cell range con¬ 
taining the filter criteria you just typed in. See Figure 2-19. 


Figure 2-19 
The Advanced 
Filter dialog 
box 



range of 
data values 


range containing 
the filter criteria 


5 Click OK. 


As before, only Stations 1, 2, and 6 are displayed. Note that the column 
totals displayed in row 12 are not adjusted for the hidden values. You have 
to be careful when filtering data in Excel because formulas will still be based 
on the entire data set, including hidden values. 

What if you wanted to look at only those service stations with either 
gasoline sales < $8,500 or other sales <= $7,500? Entering an Or condition 
between two different columns in your data set is not possible with the 
AutoFilter, but you can do it with the Advanced Filter. You do this by plac¬ 
ing the different criteria in different rows in the worksheet. 
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To create an Or condition with the Advanced Filter: 

1 Delete the criteria in cell Cl 7. 

2 Enter the criterion <= 7500 in cell CIS. 

3 Once again, click the Advanced hntton from the Sort & Filter gronp 
on the Data tah to open the Advanced Filter dialog hox. 

4 Enter the cell range All:Dll into the List range hox. 

5 Change the cell reference in the Criteria range hox to B16:C18 to 
reflect the changes yon made to the criteria. 

6 Click OK. 

Excel now displays Stations 0, 1, 2, 4, 5, 6, and 9. Each station has 
gas sales < $8,500 or sales of other items <= $7,500. See Fignre 2-20. 


Figure 2-20 
Filtering 
data using 
an Or 
condition 
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criteria specifying 
gas sales < $8500 
or 

Other sales <= $7500 

7 To view all stations again, click the Clear hntton from the Sort & 
Filter gronp on the Data tah. 
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Using Calculated Values 


You decide to reward service station managers whose daily gasoline sales 
were higher than average. How wonld yon determine which service stations 
qnalified? You could calculate average gasoline sales and enter this nnmher 
explicitly into a filter (either an AutoFilter or an Advanced Filter). One prob¬ 
lem with this approach however, is that every time yon npdate yonr service 
station data, yon have to recalculate this number and rewrite the qnery. 

However, Excel’s AntoFilter allows yon to inclnde this information in 
yonr query automatically. 


Figure 2-21 
Filtered data 
for higher- 
than-average 
gas sales 


To select stations with higher-than-average gas sales: 

1 Select the cell range All:Dll again. 

2 Click the Filter bntton from the Sort & Filter gronp on the Data tab to 
display the AntoFilter drop-down arrows. 

3 Click the Gas drop-down list arrow, click Number Filters, and then 
click Above Average. 

As shown in Fignre 2-21, the data list is filtered again, showing only 
the data from Service Stations 0, 3, 5, 7, and 8. Those are five service 
stations whose daily gas sales are higher than the average from all 
stations in the data list. 



4 Click the Filter bntton again from the Sort & Filter gronp on the Data 
tab to tnrn off the filter of the service station data. 
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For more complicated formulas, you can enter the expressions using an 
Advanced Filter. 

Yon’ve completed yonr analysis of the service station data. Save and close 
the workhook now. 

To finish your work: 

1 Click the Office hntton and then click Save. 

2 Click the Office hntton again and then click Close. 


Importing Data from Text Files 

Often yonr data will he created nsing applications other than Excel. In that 
case, yon’ll want to go throngh a process of bringing that data into Excel 
called importing. Excel provides many tools for importing data. In this 
chapter yon’ll explore two of the more common sonrces of external data: 
text files and databases. 

A text file contains only text and nnmbers, withont any of the formnlas, 
graphics, special fonts, or formatted text that yon wonld find in a workbook. 
Text files are one of the simplest and most widely nsed methods of stor¬ 
ing data, and most software programs can both save and retrieve data in a 
text file format. Thns, althongh text files contain only raw, nnformatted data, 
they are very nsefnl in sitnations where yon want to share data with others. 

Becanse a text file doesn’t contain formatting codes to give it structnre, 
there mnst be some other way of making it nnderstandable to a program that 
will read it. If a text file contains only nnmbers, how will the importing pro¬ 
gram know where one colnmn of nnmbers ends and another begins? When 
yon import or create a text file, yon have to know how the valnes are orga¬ 
nized within the file. One way to structnre text files is to use a delimiter, 
which is a symbol, usually a space, a comma, or a tab, that separates one 
column of data from another. The delimiter tells a program that retrieves the 
text file where colnmns begin and end. Text that is separated by delimiters 
is called delimited text. 

In addition to delimited text, yon can also organize data with a fixed- 
width file. In a fixed-width text file, each colnmn will start at the same loca¬ 
tion in the file. For example, the first colnmn will start at the first space in 
the file, the second colnmn will start at the tenth space, and so forth. 

When Excel starts to open a text file, it automatically starts the Text 
Import Wizard to determine whether the contents are organized in a fixed- 
width format or a delimited format and, if it’s delimited, what delimiter is 
nsed. If necessary, yon can also intervene and tell it how to interpret the 
text file. 
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Having seen some of the issnes involved in nsing a text file, yon are ready 
to try importing data from a text file. In this example, a family-owmed hagel 
shop has gathered data on wheat prodncts that people eat as snacks or for 
breakfast. The family members intend to compare these prodncts with the 
prodncts that they sell. The data have been stored in a text file, Wheat.txt, 
shown in Table 2-5. The file was obtained from the nntritional information 
on the packages of the competing wheat prodncts. 


Table 2-5 Wheat Data 


Brand 

Food 

Price 

Total 

Oz. 

Serving Calories Protein 
Grams 

Garbo 

Fiber 

Sugar Fat 

SNYDER 

PRETZEL 

2.19 

9.0 

31.0 

120 

3 

25 

1 

1 

1.0 

LENDERS 

BAGEL 

1.39 

17.1 

81.0 

210 

7 

43 

2 

3 

1.5 

BAYS 

ENG 

MUFFIN 

2.27 

12.0 

57.0 

140 

5 

27 

1 

2 

1.5 

THOMAS 

ENG 

MUFFIN 

2.94 

12.0 

57.0 

120 

4 

25 

1 

1 

1.0 

QUAKER 

OAT 

SQUARES 

CEREAL 

5.49 

24.0 

57.0 

210 

6 

44 

5 

10 

2.5 

NABISCO 

GRAH 

CRACKER 

3.17 

14.4 

31.0 

130 

2 

24 

1 

7 

3.0 

WHEATIES 

CEREAL 

5.09 

15.6 

27.0 

100 

3 

22 

3 

4 

1.5 

WONDER 

BREAD 

0.99 

20.0 

26.0 

60 

2 

13 

0 

2 

1.5 

BROWNBERRY 

BREAD 

3.49 

24.0 

43.0 

120 

4 

23 

1 

3 

2.0 

PEPPERIDGE 

BREAD 

2.89 

16.0 

25.5 

70 

2 

13 

1 

2 

1.0 


To start importing Wheat.txt into an Excel workbook: 

1 Click the Office bntton and then click Open. 

2 Navigate to the Chapter02 data folder and change the file type to 

Text Files (*.pm; *.txt; *.csv). 

Excel displays the file wheat.txt from the list of text files in the 
Chapter02 folder. 

3 Donble-click the wheat.txt file. 

Excel displays the Text Import Wizard to help yon select the text to 
import. See Fignre 2-22. 
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Figure 2-22 
Text Import 
Wizard 
Step I of 3 



The wizard has automatically determined that the wheat.txt file is orga¬ 
nized as a fixed-width text file. By moving the horizontal and vertical scroll 
hars, yon can see the whole data set. Once you’ve started the Text Import 
Wizard, you can define where varions data colnmns begin and end. Yon can 
also have the wizard skip entire colnmns. 

To define the columns you intend to import: 

1 Click the Next button. 

The wizard has already placed borders between the various columns 
in the text file. You can remove a border by double-clicking it, you 
can add a border by clicking a blank space in the Data Preview win¬ 
dow, or you can move a border by dragging it to a new location. Try 
moving a border now. 

2 Click and drag the right border for TotalOZ further to the right so 
that it aligns with the left edge of the ServingCrams column. See 
Figure 2-23. 


Chapter 2 Working with Data 65 
















3 Click the Next button. 

The third step of the wizard allows you to define column formats 
and to exclude specific columns from your import. By default, the 
wizard applies the General format to your data, which will work in 
most cases. 

4 Click the Finish button to close the wizard. 

Excel imports the wheat data and places it into a new workbook. 
See Figure 2-24. 
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Figure 2-24 
Wheat data 
imported 
into Excel 
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Notice that the data for the first two columns appear to he cut off, hut 
don’t worry. When Excel imports a file, it formats the new workbook with a 
standard column width of about nine characters, regardless of column con¬ 
tent. The data are still there but are hidden. 

To format the column widths to show all the data: 

1 Press CTRL-I-a twice to select all cells in the worksheet. 

2 Move the mouse pointer to the border of one of the column head¬ 
ers until the pointer changes to a "t* and double-click the column 
border. 

Excel changes the column widths to match the width of the longest 
cell in each column. 

3 Click cell Al to remove the selection. 

Save and close the workbook. 

4 Click the Office button and then click Save As. 

5 Type Wheat Data in the File Name box and select Excel Workbook 
(*.xlsx) from the Save As Type list box. 
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Click the Save button. 

Click the Office button and then click Close to close the workbook. 


Importing Data from Databases 

Excel allows the user to create connections to a variety of data sources. 
You’ve already seen how to create a connection to a text file; now you’ll 
learn how to create a connection to a database file. 

A database is a program that stores and retrieves large amounts of data 
and creates reports describing that data. Excel can retrieve data stored in 
most database programs, including Microsoft® Access, Borland dBASE®, 
Borland Paradox®, and Microsoft FoxPro®. 

Databases store information in tables, organized in rows and columns, 
much like a worksheet. Each column of the table, called a field, stores infor¬ 
mation about a specific characteristic of a person, place, or thing. Each row, 
called a record, displays the collection of characteristics of a particular per¬ 
son, place, or thing. A database can contain several such tables; therefore, you 
need some way of relating information in one table to information in another. 
You relate tables to one another by using common fields, which are the fields 
that are the same in each table. When you want to retrieve information from 
two tables linked by a common field. Excel matches the value of the field in 
one table with the same value of the field in the second table. Because the field 
values match, a new table is created containing records from both tables. 

A large database can have many tables, and each table can have several 
fields and thousands of records, so you need a way to choose only the infor¬ 
mation that you most want to see. When you want to look only at specific in¬ 
formation from a database, you create a database query. A database query is 
a question you ask about the data in the database. In response to your query, 
the database finds the records and fields that meet the requirements of your 
question and then extracts only that data. When you query a database, you 
might want to extract only selected records. In this case, your query would 
contain criteria similar to the criteria you used earlier in selecting data from 
an Excel workbook. 


Using Excel’s Database Query Wizard 
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You can import data from a database file directly, as you did with the wheat 
text file. You can also write a query to retrieve only portions of data from 
selected tables within the database file. 







To see how this works, you’ll import another data set containing nntri- 
tional data located in an Access database file named wheat.mdh. The database 
contains two tables: Prodnct, a table containing descriptive information abont 
each prodnct (the name, manufactnrer, serving size, price, and so on], and 
Nntrition, a table of nntritional information (calories, proteins, etc.]. Yon’ll 
import the data by creating a connection to the database file nsing Microsoft 
Query, a small application installed with most Office 2007 prodncts. 

To access Microsoft Query: 

1 Click the Office button and then click New to open a new blank 
worksheet in Excel. 

2 Click the From Other Sources button from the Get External Data 
group on the Data tab and then click From Microsoft Query. 

3 Verify that the Databases dialog sheet tab is selected. 

At this point, you’ll choose a data source. Excel provides several 
choices from such possible sources as Access, dBase, FoxPro, and 
other Excel workbooks. You can also create your own customized 
data source. In this case, you’ll use the Access data source because 
this data comes from an Access database. 

4 Click MS Access Database* from the list of data sources in the Data¬ 
bases dialog sheet and click the OK button. 

5 Navigate to the folder containing your Chapter02 data files and se¬ 
lect the wheat.mdb database file. Click the OK button. 

Excel opens the Query Wizard dialog box shown in Figure 2-25. 
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Now that you’ve started the Query Wizard, you are free to select the vari¬ 
ous fields that you’ll import into Excel. The hox on the left of the wizard 
shown in Fignre 2-25 shows the tables in the database. As yon expected, 
there are two: Nntrition and Prodnct. By clicking the pins box in front of 
each table name, yon can view and select the specific fields that yon’ll im¬ 
port into Excel. Try this now by selecting fields from both tables. 


1 

2 


To select fields for your query: 

Click the plus box [-I-] in front of the Prodnct table name. 

A space opens beneath the table name displaying the names of each 
of the fields in the table. 

Donble-click the following names in the list: 

Brand 

Food 

Price 

Package oz 
Serving oz 

As you double-click each field name, the name appears in the box 
on the right, indicating that they are part of your selection in the 
query. Note that you do not select the Product ID field. This field is 
the common field between the two tables and contains a unique id 
number for each wheat product. You don’t have to include this in 
your query. 

Click the plus box [-r] in front of the Nutrition table name and then 
double-click the following field names: 

Calories 

Protein 

Carbohydrates 

Fat 

Once again, you do not select the common field. Product ID. Your 
dialog box should appear as shown in Figure 2-26. 
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Click the Next button. 


After specifying which fields yon’ll import, yon’ll now have the opportn- 
nity to control which records to import and how yonr data will be sorted. 


Specifying Criteria and Sorting Data 

Yon can apply criteria to the data you import with the Query Wizard. At this 
point, yonr qnery will import all of the records from the Wheat database, 
but you can modify that. Say you want to import only those wheat products 
whose price is $1.25 or greater. You can do that at this point in the wizard. 
You can specify several levels of And/Or conditions for each of the many 
fields in your query. 

To add criteria to your query: 

1 Click Price from the list of columns to filter in the box at the left of 
the Query Filter dialog box. 

2 In the highlighted drop-down list box at the right, click is greater 
than or equal to. 

3 Type 1.25 in the drop-down list box to the immediate right. See 
Figure 2-27. 


Chapter 2 Working with Data 71 










Figure 2-27 
Selecting 
only those 
records 
whose price 
>= $1.25 
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This criterion selects only those records whose price is greater than 
or eqnal to $1.25. 

4 Click the Next hntton. 


The last step in defining yonr qnery is to add any sorting options. Yon can 
specify np to three different fields to sort hy. In this example, yon decide to 
sort the wheat prodncts hy the amonnt of calories they contain, starting with 
the highest-calorie prodnct first and going down to the lowest. 


1 

2 


To specify a sort order: 

Select Calories from the Sort hy list hox. 

Click the Descending option hntton. See Fignre 2-28. 


72 Excel 

























Figure 2-28 
Sorting 
records in 
descending 
order of 
calories 



3 Click the Next button. 


The last step in the Qnery Wizard is to choose where yon want to send 
the data. Yon can 

1. Import the data into your Excel workbook. 

2. Open the results of your query in Microsoft Query. Microsoft Query is a 
program included on yonr installation disk with several tools that allow 
yon to create even more complex qneries. 

You can learn more abont these two options in Excel’s online Help. In this 
example, yon’ll simply retrieve the data into your Excel workbook. 

To finish retrieving the data: 

1 Click the Return Data to Microsoft Excel option button. 

2 Click the Finish button. 

Yon can now specify where the data will be placed. The default will 
be to place the data in the active cell of the current worksheet. In 
this case, that is cell Al. Accept this default. 

3 Click the OK button. 

Excel connects to the Wheat database and retrieves the data shown 
in Fignre 2-29. Note that these inclnde only those wheat prodncts 
whose price is $1.25 or greater and that the data are sorted in de¬ 
scending order of calories. Also note that Excel has antomatically 
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formatted the data values in a table and added AntoFilter bnttons to 
filter the data if yon so desire. 


Figure 2-29 
Data 
retrieved 
from the 
Wheat 
database 



Unlike importing from a text file, yon can have Excel antomatically re¬ 
fresh the data it imports from a database. Thus, if the source database 
changes at some point, yon can antomatically retrieve the new data withont 
recreating the qnery. Commands like refreshing imported data are available 
from the Connections group on the Data tab. Table 2-6 describes some of 
these commands. 


Table 2-6 Commands on the Data tab 


Button 

Icon 

Purpose 

Refresh All 


Refresh all data queries located in the 
current workbook 

Connections 

;;r; 

View and modify the properties of all the 
data connections in the workbook 



Properties 

f. 

View and modify the properties of the 
currently selected data range 


Having seen how one wonld import data from a database into Excel, yon 
are ready to save and close the workbook. 

To save and close the Wheat workhook: 

Click the Office bntton and then click Save As 

Type Wheat Database in the File name box, verify that Excel Workbook 
(*.xlsx) is displayed in the Save as type box, and then click Save. 
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3 

4 


Click the Office button and then click Close. 
Click the Office bntton and then click Exit Excel. 


Exercises 

1. Air qnality data had been collected from 
the Environmental Protection Agency 
(EPA) and stored in an Excel workbook. 
The workbook displays the nnmber of 
nnhealthful days (heavy levels of pollntion) 
per year for 14 major U.S. cihes in the year 
1980 and then from 2000 throngh 2006. 
□pen this workbook and examine the data. 

a. Open the Pollution workbook from 
the Chapter02 folder and save it as 
Pollution Report. 

b. Create a new colnmn named 
AVER00_06 that nses Excel’s averaged 
fnnction to calcnlate the average 
pollntion days for each city from 2000 
throngh 2006. 

c. Create a new colnmn named DIFF06_ 
80 that calcnlates the difference 
between the average nnmber of polln¬ 
tion days from 2000 throngh 2006 and 
the nnmber of pollntion days in 1980 
for each of the 14 cities. 

d. Sort the data in ascending order of the 
DIFF80_06 colnmn yon jnst created. 
Which cities showed an increase in 
the nnmber of pollntion days? Which 
cities showed a decrease? Which city 
showed the greatest improvement in 
terms of the decline in the nnmber of 
nnhealthy days? 

e. Create a new colnmn that calcnlates 
the ratio of nnhealthy days between 
the average from 2000 throngh 2006 
and the valne for 1980. Name the col¬ 
nmn RATIO06 80. 


f. Format the valnes in the RATIG06_80 
colnmn as a percentage to two deci¬ 
mal places. 

g. Sort the data in ascending order of 
the RATIG06_80 colnmn. Which city 
showed the greatest improvement in 
the terms of the ratio of the 2006 aver¬ 
age to the 1980 valne? 

h. Create range names for all of the col- 
nmns in yonr workbook. 

i. Save yonr workbook and write a re¬ 
port snmmarizing yonr observations. 
Does the data prove any conclnsions 
yon might have reached? What kind 
of information might be missing from 
this data set? Remember that yon only 
have one year’s worth of data from the 
1980s versns seven years data from 
2000 to 2006. In what way conld the 
average valne from those seven years 
not be comparable to a single year’s 
valne from 1980? 

2. Data on soft drink sales shown in 
Table 2-7 have been saved in a text file. 
The file has five variables and ten cases. 
The first variable is the name of the soft 
drink brand; the next three variables are 
company sales in millions of 192-onnce 
cases for the years 2000, 2001, and 
2002. [Source: http://www.bevnet.com/ 
news/2002/03-01-2002-softdrink.asp, 
Beverage Marketing Corporation.) The 
final colnmn indicates the year of origin 
for each brand. 
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Table 2-7 Soft Drink Sales Data 


Brand 

Cases2000 

Cases2001 

Cases2002 

Origin 

Coca-Cola 

3198.0 

3189.6 

3288.9 

1886 

Pepsi 

2188.0 

2163.9 

2156.4 

1898 

Monntain Dew 

810.3 

853.7 

862.7 

1946 

Dr Pepper 

747.4 

740.0 

737.4 

1885 

Sprite 

713.9 

703.3 

687.9 

1961 

Gatorade 

355.8 

375.0 

422.8 

1965 

7 Up 

276.0 

261.6 

243.4 

1929 

Tropicana 

301.2 

307.7 

292.9 

1954 

Minnte Maid 

218.0 

226.5 

285.3 

1946 

Aqnafina 

105.0 

151.4 

203.0 

1994 


a. Import the Drinks.txt file from the 
Chapter02 data folder into an Excel 
workhook (note that colnmns are 
delimited hy tabs). 

b. Create range names for each of the 
five data colnmns in the workhook. 

c. Create two new colnmns displaying 
the change in sales from 2000 to 2002 
and the ratio of the 2000 sales to the 
2002 sales. Assign range names to 
these two new colnmns. 

Sort the list in descending order of 
the difference in sales. 

d. Is there any relationship between the 
year in which the brand was fonnded 
and the change in sales? [Hint: Are 
the older brands showing less growth 
than the new brands?) 

e. Repeat yonr analysis nsing the ratio of 
sales. 


f. Save the workbook in Excel format 
to the Chapter02 folder nnder the 
name Soft Drinks Sales Report and 
write a report snmmarizing yonr 
observations. 

3. The NCAA reqnires schools to snbmit 
information on gradnation rates for its 
stndent athletes. Table 2-8 shows the 
data for the 11 schools in the Big Ten, 
covering the years 1997 throngh 2000, 
indicating the gradnation percentage 
(within six years.) The overall gradna¬ 
tion percentage for all nndergradnates 
is shown in the Gradnated colnmn and 
then is broken down by race and gender 
in the remaining fonr colnmns of the 
table for those who received athletic 
scholarships. 


Table 2-8 

Big len Graduation Data 




University 

Graduated 

White Males 

Black Males White Females 

Black Females 

ILL 

81 

70 

52 

77 

83 

IND 

72 

61 

45 

76 

82 

IOWA 

66 

61 

51 

81 

50 

MICH 

86 

79 

44 

88 

67 

MSU 

72 

61 

33 

87 

63 

MINN 

58 

63 

39 

70 

56 

NU 

93 

87 

79 

94 

100 


(continued) 
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□su 

66 

60 

42 

77 

83 

PSU 

84 

76 

69 

91 

93 

PU 

67 

66 

48 

84 

80 

WIS 

77 

65 

50 

79 

64 


a. Enter the data from Table 2-8 into a 
blank workbook and save the work¬ 
book as Big Ten Graduation to the 
Chapter02 folder. 

b. Create two new colnmns displaying 
the difference between white male 
and white female gradnation rates 
and the ratio between white male and 
white female gradnation rates. How 
do gradnation rates compare? 

c. Create two more colnmns calcnlating 
the difference and ratio of the white 
female gradnation rate to the overall 
rate from 1997 to 2000. 

d. What do yon observe from the data? 
Does one nniversity stand ont from 
the others? 

e. Sort the data files in descending order 
of the ratio of white male to white 
female gradnation rate. Create range 
names for all of the colnmns in the 
workbook. 

f. Save yonr changes to the work¬ 
book and write a snmmary of yonr 
observations. 

4. Over 4,000 television viewers were 
interviewed in 1984 to determine which 
television ads were remembered for being 
significant and interesting. The level of 
retained impressions were then compared 
to the advertising bndgets from each firm. 
[Source: Wall Street Journal, 1984.) 

a. Open the TV Ads workbook from the 
Chapter02 folder and save it as TV 

Ads Analysis. 

b. Calcnlate the ratio of the retained 
impressions per week to the advertising 
bndget. 


c. Create range names for the three 
colnmns in the worksheet. 

d. Sort the list in descending order of 
ratio valnes. Which firm showed the 
greatest bang for the bnck from their 
advertising dollars? Print the sorted 
data valnes. 

e. Filter the data list, showing only 
those firms with a higher-than-average 
ratio of retained impressions to adver¬ 
tising dollars. Print the filtered valnes. 

f. Save yonr changes to yonr workbook 
and then write a report snmmarizing 
yonr observations. 

5. The Teacher.txt file contains the aver¬ 
age pnblic teacher pay and spending 
on pnblic schools per pnpil in 1985 for 
50 states and the District of Colombia 
as reported by the National Edncation 
Association. 

a. Open the Teacher.txt file from the 
Chapter02 data folder as a tab- 
delimited text file. There are fonr 
colnmns in the text file. The State col- 
nmn contains the abbreviations of the 
50 states and the District of Colombia. 
The Pay colnmn contains the average 
annnal salary of pnblic school teachers 
in each state and district. The Spend 
colnmn contains the pnblic school 
spending per pnpil for each state and 
district. The Area colnmn contains the 
area in the conntry for each state or 
district. Import all of these colnmns 
except the Area colnmn. 

b. Create a new colnmn calcnlating the 
ratio of the Pay colnmn to the Spend 
colnmn. 
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c. Create range names for all of the col- 
nmns in the worksheet. 

d. Sort the data in ascending order of the 
Ratio colnmn. 

e. Filter the data valnes, showing only 
those states or districts that have a 
ratio valne of less than 6. 

f. Save yonr workbook as Teacher 
Salary Analysis to the Chapter02 
folder in Excel workbook format and 
snmmarize yonr observations. 

6. Working in gronps in a high school 
chemistry lab, stndents measnred the 
mass (grams) and volnme (cnbic cen¬ 
timeters) of eight alnminnm chnnks. 
Both the mass in grams and the volnme 
in cnbic centimeters were measnred 
for each chnnk. Analyze the data from 
the lab. 

a. Open the Aluminum workbook from 
the Chapter02 folder and save it as 
Aluminum Density Analysis. 

b. Create a new colnmn in the work¬ 
sheet, compnting the density of each 
chnnk (the ratio of mass to volnme). 
Apply a range name to the new 
colnmn. 

c. Sort the data from the chnnk with 
the highest density to that with the 
lowest. 

d. Calcnlate the average density for all 
chnnks. 

e. Is there an extreme valne (an observa¬ 
tion that stands ont as being different 
from the others)? Calcnlate the aver¬ 
age density for all chnnks aside from 
the ontlier. Print yonr resnlts. 

f. Which of the two averages gives the 
best approximation of the density of 
alnminnm? Why? 

g. Save yonr changes to the workbook 
and snmmarize yonr resnlts. 

7. The Economy workbook has seven vari¬ 
ables related to the US economy from 
1947 to 1962. 


a. Open the Economy workbook from 
the Chapter02 data folder. The 
Deflator variable is a measure of the 
inflation of the dollar; arbitrarily set 
to 100 for 1954. The GNP column 
contains the Gross National Product 
for each year (in millions). The 
UnEmploy column contains the num¬ 
ber unemployed in thousands, and 
the Arm Force column has the num¬ 
ber in the armed forces in thousands. 
The Population column contains the 
population in thousands. The Total 
Emp contains the total employment 
in thousands. Save the workbook as 
Economy Data. 

b. Create range names for each column 
in the worksheet. 

c. Notice that values in the Population 
column increase each year. Use the 
Sort command to find out for which 
other columns this is true. 

d. There is an upward trend to the GNP, 
although it does not increase each 
year. Create a new column that cal¬ 
culates the GNP per person for each 
year. Name this new column GNPPOP 
and create a range name for the values 
it contains. 

e. Save your changes to the workbook 
and write a report summarizing your 
observations. 

8. An analyst has collected 2007 data in¬ 
cluding salary and batting average for 
major league players. Examine the data 
that have been collected. 

a. Open the Baseball workbook from the 
Chapter02 folder and save it as Base¬ 
ball Salary Analysis. 

b. Create range names for all of the 
columns in the workbook. 

c. Sort the data values in descending 
order of batting average. 

d. Display only those players whose 
career batting average is 0.310 or 
greater. List these players. 
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e. Remove the filter, displaying all val¬ 
ues in the workbook again. 

f. Add a new column to the worksheet, 
displaying the hatting average divided 
hy the player’s salary and multiplied 
hy 1,000,000. 

g. Sort the worksheet in descending or¬ 
der of the new column you created. 
Who are the top-ten players in terms 
of hatting average per dollar? Print 
your results. 

h. Examine the number of years played 
in your sorted list. Where does it ap¬ 
pear that most of the first-year players 
lie? What would account for that? 
[Hint: What are some of the other fac¬ 
tors besides batting average that may 
account for a player’s high salary?) 

i. Save your workbook and write a re¬ 
port summarizing your observations. 

9. An analyst has collected data on the 
death rates for diabetes and influenza- 
pneumonia for the year 2003. The data 
have been saved in the Health work¬ 
book. Your job is to examine the data 
values from the workbook. 

a. Open the Health workbook from the 
Chapter02 folder and save it as Health 
Report Analysis. 

h. Sort the data by ascending order of 
diabetes-related deaths. Then do a 
sort on influenza-pneumonia related 


deaths. Which states have the highest 
and lowest values in those categories? 

c. Use the AutoFilter to list the top-ten 
states in each category. Print your fil¬ 
tered worksheet. 

d. Turn off the filter and create a new 
column calculating the ratio of the 
diabetes-related death rate to the 
pneumonia-related death rate in each 
of the 50 states. Create a range name 
for the new column. 

e. Format the ratio values in the new 
column as percentages to two decimal 
places. 

f. Sort the data in ascending order of the 
new column. Which state or region 
has the highest ratio of diabetes- 
related deaths? African-Americans 
have a much higher rate of diabetes 
than whites. Discuss how this ex¬ 
plains your observation of the state 

or region with the highest rate of 
diabetes-related death. 

g. Create range names for the all of the 
columns in your workbook. 

h. Save your changes to the workbook 
and write a summary of your 
observations. 

10. The Cars workbook contains data from 

Consumer Reports.org®, February 1, 

2008, on 275 different car models, as 

described in the following table: 


Table 2-9 Car Data 


Field 

Description 

Model ID 

Number from 1 to 275 

Model 

Make and model 

Type 

Type of vehicle 

Price 

Price in dollars 

HP 

Horsepower 

Eng size 

Engine size in liters 

Cyl 

Number of cylinders 

Eng Type 

Type of engine 


(continued) 
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MPG 

TimeO-60 

Weight 

Date 

Region 

Eng typeOl 


Overall miles per gallon 
0-60 time in seconds 
Weight in ponnds 
Date of Issne 

Original location of mannfacturer 
1 if hybrid or diesel, 0 otherwise 


(Source; Copyright 2008 by Consumers Union of U.S., Inc. Yonkers, NY 10703-1057, a nonprofit organization. 
Reprinted with permission from the February 2008 posting of ConsumerReport.org <http://www.consumerreports 
.org/> ® for educational purposes only. No commercial use or reproduction permitted.) 


a. Open the Cars workbook from the 
Chapter02 folder. 

b. Create range names for all of the data 
colnmns yon retrieved. 

Using techniqnes that yon’ve learned 
in this chapter, answer the following 
qnestions: 

c. How many cars come from the model 
year 2007? 

d. Which car has the highest horsepower? 

e. Which car has the highest horsepower 
relative to its weight? 

f. Which car has the highest miles per 
gallon (MFC)? 


g. Which car has the highest MPG rela¬ 
tive to its 0-60 time? Note that this 
car will have fast acceleration and 
pretty good MPG. 

h. If yon sort the data by the ratio of 
MPG to 0-60 time, what is the Region 
of most of the cars with low valnes? 
What kind of vehicles are these? What 
are the Regions of most of the highest 
models, and what kind of vehicles do 
yon find high on this scale? Discnss 
the relationship of this scale to Weight. 

i. Save yonr workbook as Cars Perfor¬ 
mance Analysis. 
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Chapter 3 


Working with Charts 

Objectives 

In this chapter you will learn to: 

P- Identify the different types of charts created hy Excel 
P- Create a scatter plot with the Chart Wizard 
P- Edit the appearance of yonr chart 
P- Label points on yonr scatter plot 
P- Break a scatter plot down hy categories 
P- Create a hnhhle plot 

P- Create a scatter plot containing several data series 
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I n Chapter 2, you learned how to work with data in an Excel worksheet. 
In this chapter you’ll learn how to display those data with charts. This 
chapter focuses primarily on two types of charts: scatter plots and 
huhhle charts. Both are important tools in the field of statistics. You’ll also 
learn how to use some features of StatPlus that give you additional tools in 
working with and interpreting your charts. 


Introducing Excel Charts 

A picture is worth a thousand words. Properly designed and presented, a 
graph can he worth a thousand words of description. Concepts difficult to 
describe through a recitation of numbers can be easily displayed in a chart 
or plot. Charts can quickly show general trends, unusual observations, and 
important relationships between variables. In Table 3-1, a table of monthly 
sales values is displayed. How do sales vary during the year? Which month 
in the table displays an unusual sales result? Can you easily tell? 


Table 3-1 Monthly Sales Values 


Date 

Sales 

Jan. 2010 

$16,800 

Feb. 2010 

$19,300 

Mar. 2010 

$21,100 

Apr. 2010 

$21,200 

May 2010 

$20,700 

Jun. 2010 

$19,200 

Jul. 2010 

$16,100 

Aug. 2010 

$14,900 

Sep.2010 

$12,100 

Oct. 2010 

$11,900 

Nov. 2010 

$12,500 

Dec. 2010 

$14,300 

Jan. 2011 

$17,500 

Feb.2011 

$19,600 

Mar. 2011 

$20,900 

Apr. 2011 

$18,200 

May 2011 

$20,600 

Jun. 2011 

$18,800 

Jul. 2011 

$17,100 

Aug. 2011 

$14,100 
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It’s difficult to answer those questions by examining the table. Now let’s 
plot those values in Figure 3-1. 



The chart clarifies things for us. We notice immediately that the sales 
figures seem to follow a classic seasonal curve with the highest sales occur¬ 
ring during the late winter—early spring months. However, the sales figures 
for April 2011 seem to be too low. Perhaps something occurred during this 
time period that should be investigated, or perhaps an erroneous value was 
entered. In any case, the chart has provided insights that would have been 
difficult to immediately grasp from a table of values alone. 

Excel supports several different chart types for different situations. 
Table 3-2 shows a partial list of these. 

Table 3-2 Excel Chart lypes 

Name Icon 

Area , 

4 


Column 


(continued) 


Description 

An area chart displays the magnitude of change over 
time or between categories. You can also display the 
sum of group values, showing the relationship of each 
part to the whole. 

A column chart shows how data change over time or 
between categories. Values are displayed vertically, 
categories horizontally. 
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Bar 

Line 

Pie 

Doughnut 

Stock 

XY(Scatter] 



4 

4 


Bubble 

Radar 

Surface 

Cone, Cylinder, 
and Pyramid 




A bar chart shows how data change over time or 
between categories. Values are displayed horizontally, 
categories vertically. 

A line chart shows trends in data, spaced at equal 
intervals. It can also be used to compare values 
between groups. 

A pie chart shows the proportional size of items that 
make up the whole. The chart is limited to one data 
series. 

A doughnut chart, like the pie chart, shows the 
proportional size of items relative to the whole; it can 
also display more than one data series at a time. 

The stock chart is used to display stock market data, 
including opening, closing, low, and high daily 
values. 

An XY(scatter] chart displays the relationship between 
numeric values in several data series. The chart is 
commonly used for scientific data and is also known 
as a scatter plot. 

A bubble chart is a type of scatter plot in which the size 
of the bubbles is proportional to the value of a third 
data series. 

A radar chart shows values from different categories 
radiating from a center point. Lines connect the values 
within each data series. 

A surface chart shows the value of a data series in 
relation to the combination of the values of two 
other data series. Surface charts are often used in 
topographical maps. 

Cone, cylinder, and pyramid charts are similar to 
bar and column charts except that they use cones, 
cylinders, and pyramids for the markers. 


Excel includes variations of each of these chart types. For example, the 
column charts can display values across categories or the percentage that 
each value contributes to the whole across categories. Many of the charts 
can be displayed in 3D as well. 

Most of the charts you’ll create in this book will be of the XY(scatter) type. 
Other chart types, like stock charts, are designed for specific types of data 
(like stock market data), and they are not as useful for general data analysis. 
In addition, the StatPlus add-in included with this book gives you the capa¬ 
bility of creating other types of charts not part of Excel’s library of built-in 
charts. You’ll learn about these charts as you read through these chapters. 

Excel charts are placed in workbooks in one of two ways: either as 
embedded chart objects, which appear as objects within worksheets, or as 


84 Excel 





chart sheets, which appear as separate sheets in the workbook. Figure 3-2 
shows examples of hoth ways of displaying a chart. 




chart appears 
within as a 
separate sheet 
in the workbook 
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Introducing Scatter Plots 

In this chapter, we’ll examine athletic graduation rates for a group of uni¬ 
versities. The Big Ten workbook contains information on graduation rates 
of student athletes (with athletic scholarships) who enrolled as freshmen at 
Big Ten universities in 1997, 1998, 1999, and 2000. Each NCAA Division I 
college or university is required to distribute this information to prospective 
student-athletes and parents, so that potential recruits have a way of com¬ 
paring the education environment between different universities. Table 3-3 
describes the range names used in the workbook. 


Table 3-3 Big len Graduation Rates 


Range Name 

Range 

Description 

University 

A2:A12 

Name of the university 

SAT 

B2:B12 

Average SAT scores of all freshmen 

ACT 

C2:C12 

Average ACT scores of all freshmen 

SAT_Calc 

D2:D12 

SAT calculated from ACT based on a formula from 
the College Board 

Graduated 

E2:E12 

Percentage of all freshmen graduating within 6 years 
of enrolling. 

White_Males 

F2:F12 

Six-year graduation rates for white male athletes 

Black_Males 

G2:G12 

Six-year graduation rates for black male athletes 

White_Females 

H2:H12 

Six-year graduation rates for white female athletes 

Black_Females 

12:112 

Six-year graduation rates for black female athletes 

Enrollment 

J2:J12 

Total emollment at the university 

Top_25 

K2:K12 

Percentage of incoming students graduating in the top 
25% of their high school class 

Top_25_Rate 

L2:L12 

Indicates whether more than 80% or less than 80% of 
the incoming freshmen graduated in the upper quarter of 
their class 


To open the Big Ten workbook: 

I Start Excel and open the Big Ten workbook from the Chapter03 
folder. 

The workbook opens to a sheet displaying graduation data from the 
11 universities in the Big Ten. See Figure 3-3. Range names based 
on the column labels of each column have already been created for 
you. There are some missing values in the worksheet, such as the 
SAT value for the University of Iowa in row 4. However for univer¬ 
sities that do not collect SATs, a calculated estimate of the SAT is 
_ displayed in column E. 
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Figure 3-3 
The Big Ten 
workbook 



2 Save the file as Big Ten Graduation Chart. 


A school with a low graduation rate among its stndent-athletes is vnlner- 
ahle to investigation and possible sanctions on the part of the NCAA. We’re 
going to explore the relationship between the average SAT score from classes 
of incoming first-year stndents and the percentage of those stndents in those 
classes who eventnally gradnate within six years of entering college. 

One qnestion we might ask is: Do incoming classes with high average 
SAT scores have higher rates of gradnation? Is this trne for all nniversities? 
We’ll get a visnal pictnre of this relationship by prodncing a scatter plot. 

A scatter plot is a chart in which observations are represented by points 
on a rectangnlar coordinate system. Each observation consists of two valnes: 
One valne is plotted against the vertical or y axis, and the second valne is 
plotted against the horizontal or x axis (see Fignre 3-4). In Fignre 3-4 we are 
plotting a point with x = 3 and y = 5. 
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y axis 


Figure 3-4 
The 

rectangular 
coordinate 
system of a 
scatter plot 


10 

Scatterplot 


« 



4 

1 



i 

» 1 4 • 

■ 

1 ir 


point corresponds to an 
observation with the pair 
of values (3, 5) 

X axis 


By observing the placement of the points on the scatter plot, yon can get 
a general impression of the relationship between the two sets of valnes. For 
example, the scatter plot of Fignre 3-5 shows that high valnes of Variable 1 
are associated with low valnes of Variable 2. This is not a perfect associa¬ 
tion; rather, there is some scatter in the points. 


Figure 3-5 
Scatter plot 
displays a 
general 
relation 
between 
two sets of 
values 





Scatterplot 










2 





high y values are 
associated with 
low X values 


The scatter plot yon’ll create will have the gradnation rate for each nni- 
versity on the y axis and the nniversity’s average SAT score on the x axis. 
To create a chart, yon insert a chart object on the worksheet nsing the Excel 
Insert tab. 
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To insert a chart: 


1 Select the cell range Dl:El2. 

2 Click the Scatter button from the Charts group on the Insert tah 
and then click the Scatter with only markers option as shown in 
Figure 3-6. 


Figure 3-6 
Inserting a 
scatter plot 


Insert tab 



Excel inserts an embedded chart object containing the scatter 
plot of the Graduated values versus the SAT calculated values 
(see Figure 3-7). 
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Figure 3-7 
Embedded 
scatter plot 



Chart Tools ribbon appears when chart is selected 


selected plot 


Once you’ve created the basic scatter plot, you can format its appearance 
nsing the commands on the Chart Tools ribbon. Note that the Chart Tools 
ribbon is a contextnal ribbon and will appear within the Excel window 
whenever a chart object is selected in the new workbook. Newly created 
charts are selected by defanlt. 


EXCEL TIPS 



When yon insert a scatter plot nsing Excel, the colnmn of data 
valnes to the left will be nsed for the x valnes, the colnmn on the 
right will be nsed for the y valnes. If yonr colnmns are not laid 
ont this way, then do the following: 

a. Generate the scatter plot with the two colnmns as they are 
cnrrently laid ont. 

b. Click the Select Data bntton from the Data gronp on the 
Design tab of the Chart Tools ribbon to open the Select Data 
Sonrce dialog box. 
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c. Click the Edit button within the dialog box, and then select 
a different data range for the Series X valnes and the Series 
Y valnes and the Series name. 

If yon don’t want to mannally edit yonr scatter plot this way, 
yon can also create scatter plots nsing StatPlns. Simply click the 
StatPlns menn, click Single Variable Charts, and then click Fast 
Scatterplot. From the dialog box that appears, yon can select the 
X axis and y axis valnes withont regard to their colnmn order. 

• Excel snpports five bnilt-in scatter plots, allowing the nser to 
connect the scatter plot points with straight or smoothed lines. 


Editing a Chart 

Using Excel’s editing tools, yon can modify the symbols, fonts, colors, and 
borders nsed in yonr chart. Yon can change the scale of the horizontal and 
vertical axes and insert additional data into the chart. To start, yon’ll edit the 
size and location of the embedded chart object yon jnst created with Excel. 


Resizing and Moving an Embedded Chart 

Newly created charts are inserted as embedded objects with selection han¬ 
dles aronnd the chart. When a chart is selected, yon can move and resize it 
on the worksheet. The chart yon’ve jnst created covers some of the data on 
the Grad Percents worksheet. Move it to a different location. 

To move the embedded chart: 

1 Click an empty area within the embedded chart, either above or to 
the right of the chart area, and hold down the monse bntton. As yon 

press the monse bntton down, the pointer changes to a ' . 

Note: If yon click the title or other chart element, that element will 
have a selection border aronnd it. If this happens, click elsewhere 
within the chart, holding down the monse bntton. Yon don’t want to 
select individnal chart elements yet. 

2 With the monse bntton still pressed down, move the chart down so 
that the npper-left corner of the chart covers cell B14 and release the 
monse bntton (see Fignre 3-8.) 
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Figure 3-8 
Moving an 
embedded 
chart 



Around the selected chart object at the four corners and four sides are 
selection handles that you can use to resize the embedded chart. As you 
move the pointer arrow over the handles, you’ll see the pointer change to 
a double-headed arrow of various orientations. Each pointer allows you 
to resize the chart in the direction indicated. Try using the selection handles 
to make the chart a little larger. 

To enlarge the chart: 

Move your mouse pointer over the handle on the right edge of the 
chart until the pointer changes to a * . 

Drag the pointer to the right so that the embedded chart covers 
column I. 

Move the pointer to bottom edge of the chart object and drag the 
selection handle so that the border edge covers row 35 (see Figure 3-9). 


1 

2 
3 
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Figure 3-9 
Resizing an 
embedded 
chart 



selection 

handle 


Moving a Chart to a Chart Sheet 


To move an embedded chart to its own chart sheet, you can use the Move 
Chart button from the Chart Tools ribbon. Try this now by moving the 
embedded chart you just created to a chart sheet. 


Figure 3-10 
Move Chart 
dialog box 


To move an embedded chart to a chart sheet: 

1 With the embedded chart still selected, click the Move Chart button 
located in the Location group of the Design tab of the Chart Tools ribbon. 

2 Excel opens the Move Chart dialog box. Click the New sheet option 
button and type Graduation Chart in the accompanying text box as 
shown in Figure 3-10. 
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3 Click the OK button. 

As shown in Figure 3-11, the chart is moved to a chart sheet named 
Graduation Chart. 


Figure 3-11 
Newly inserted 
chart sheet 



Move Chart button 


Graduation Chart sheet name 


Working with Chart and Axis Titles 

To make your charts easier to interpret, you should add titles to both axes 
and over the entire chart area. By default. Excel will display the name used 
for the y axis data values as the chart title. In this case the chart title is 
Gradnated since the y axis values come from the Graduated column, which 
is column G on the Grad Percents worksheet. 

To insert different titles, use the Chart Title and Axis Title buttons from 
the Chart Tools ribbon or you can select the titles from the chart and type 
over the current titles. Try this by changing the chart title to Big Ten Gradu¬ 
ation Percentages. 


94 Excel 























To change the chart title: 

1 Click the Graduated title located directly above the scatter plot. 

Selection handles appear aronnd the chart title indicating that it is 
selected. 

2 Type Big Ten Graduation Percentages and press Enter. As shorvn in 
Figure 3-12, the title changes to reflect the new text. 


Figure 3-12 
Changing the 
chart title 



Bif'kn Gradual 

on Pafcantafas 





chart ti 
selection 

tie with 
handles 


Next, add titles to both the y axis and x axis. Since these titles are not 
currently displayed on the chart you have to add them with the Axis Titles 
button. 


To insert the axis titles: 

1 Click the Axis Titles button from the Labels group on the Layout tab 
of the Chart Tools ribbon, click Primary Horizontal Axis Title, and 
then click Title Below Axis. 

Excel inserts the text Axis Title below the horizontal, or x, axis. The 
Axis Title text is surrounded with selection boxes indicating that it 
is the currently selected object in the chart. 

2 Type Calculated SAT Values and press Enter. 

3 Click the Axis Titles button again, click Primary Vertical Axis Title, 
and then click Rotated Title. 

4 Type Graduation Percentage and press Enter. Figure 3-13 shows the 
newly added axis titles. 
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Chart Title and Axis Titles 



The chart and axis titles you’ve entered can he formatted using the same 
formatting buttons found on the Home tah that you use for formatting cell 
text. Try using these buttons now to increase the font size of the two axes 
titles. 

To format the axis titles: 

1 Click the Home tab with the Graduation Percentage title still 

selected; then click the Font Size button ' * , changing the font 

size to 16. 

2 Click the Calculated SAT Values x axis title and change the font size 
to 14 using the same technique. 
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EXCEL TIPS_ 

^ • You can also use the format buttons on the Home tab to change 

the font color, font style, alignment, and fill color of chart and 
axes titles. 

• You can further format chart and axes titles by selecting the title 
and then clicking the Format Selection button from the Current 
Selection group on the Layout tab of the Chart Tools ribbon. The 
button opens a Format Axis Title dialog box containing several 
formatting options. 


Editing the Chart Axes 

Next you’ll look at editing the values displayed in the chart. Even though 
all of the Big Ten graduation rates are 50% or greater in this chart. Excel 
uses a range of 0 to 100%; and even though the low^est SAT score is 1105, 
Excel uses a lower range of 0 in the chart. The effect of this is that all of the 
data are clustered in the upper right edge of the chart, leaving a large blank 
space to the left and below. There are some situations where you want your 
charts to show the complete range of possible values and other situations 
where you want to concentrate on the range of observed values. In that case, 
you rescale the axes so that the scales more closely match the range of the 
observed values. You can change the scale of the axes by clicking the Axes 
button on the Chart Tools ribbon. Start with changing the scale on the x axis 
to better match the range of calculated SAT scores for the 11 schools in the 
data sample. 

To change the scale of the x axis: 

1 Click the Axes button from the Axes group on the Layout tab of the 
Chart Tools ribbon, click Primary Horizontal Axis and then click 
More Primary Horizontal Axis Options. Excel opens the Format 
Axis dialog box. 

2 Verify that the Axis Options tab is selected. On this tab the options 
for the axis scale are shown. By default. Excel automatically selects 
the minimum and maximum range of the axis values. You want to 
change the minimum from 0 to 1000, set the maximum at 1500, and 
set the interval between tick marks on the axis to 50 units. To make 
this change, do the following steps: 

3 Click the Fixed option button for the Axis minimum and enter 1000 
in the accompanying text box. 
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4 Click the Fixed option button for the Axis maximum and enter 1500 
in the accompanying text box. 

5 Click the Fixed option button for the Axis major unit and enter 50 
in the accompanying text box. Figure 3-14 shows the completed 
Format Axis dialog box. 


Figure 3-14 
Setting the 
axis scale 



axis scale goes 
from 1000 to 1500 
in steps of 50 


6 Click the Close button. Figure 3-15 shows the appearance of the 
chart with the revised scale for the x axis. 
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Figure 3-15 
Revised scale 
for the X axis 
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axis scale goes from 1000 to 1500 in steps of 50 


Now the data points are only compressed near the top of the chart. Yon 
can change this hy setting the scale of the gradnation percentages on the 
y axis to go from 50 to 100% in steps of 5%. 

To change the scale of the y axis: 

1 Click the Axes hntton from the Axes gronp on the Layont tah of the 
Chart Tools rihhon, click Primary Vertical Axis and then click More 
Primary Vertical Axis Options. Excel opens the Format Axis dialog 
hox for the vertical axis. 

2 Click the Fixed option hnttons for the Axis minimnm, maximnm, 
and major nnits and change the minimnm valne to 50, the maximnm 
valne to 100, and the major nnit to 5. 

3 Click the Close hntton. Fignre 3-16 shows the revised scale for the 
vertical axis on the chart. 
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Figure 3-16 
Revised scale 
for the y axis 


axis scale goes 
from 50 to 100 
in steps of 5 
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EXCEL TIPS 



• You can display the axis scale in nnits of thonsands, millions, 
and billions by selecting the appropriate options from the Axes 
bntton. 

• Yon can display valnes on a log scale by selecting the Show Axis 
with Log Scale option on the Axes bntton. 


Working with Gridlines and Legends 

Another chart object that appears in the gradnation scatter plot is gridlines. 
Gridlines are vertical or horizontal lines that match np with the major and 
minor nnits on the x and y axes. Gridlines can make it easier to line np data 
valnes within the scatter plot. By default, Excel will open a scatter plot with 
horizontal gridlines matching the major nnits on the y axis. Yon can add or 
remove major and minor gridlines from scatter plots nsing the commands 
on the Chart Tools ribbon. Try this by adding vertical gridlines to the gradn¬ 
ation percentage scatter plot. 
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To add vertical gridlines: 


I With the chart still selected, click the Gridlines button from the Axes 
group on the Layout tah of the Chart Tools rihhon, click Primary 
Vertical Gridlines and then click Major Gridlines. 

As shown in Figure 3-17, vertical gridlines are added to the scatter plot. 


Figure 3-17 
Adding vertical 
gridlines 




Bik Tin Gr^diMtlon Pcrc«nlaf«t 


































• 






i 

t 





• 


• 






1 

a 







» 






« 




• • 









1 




• 

• 

» 

























« 























MM MM IMi IkM 

M 

“ 1 

m 

4^ 14 

M t«a IIM 




VMiMt 



You can edit the format of the gridlines hy clicking More Primary Grid- 
lines Options command found on the menu of commands for each gridline. 
By modifying the format, you can change the gridline’s color and style as 
well as add drop shadows to each gridline. 

The graduation percentage scatter plot also contains a legend. A legend 
is a hox that identifies the patterns or colors that are assigned to the data 
points in a chart. When you insert a chart. Excel automatically adds a leg¬ 
end. In the graduation percentage chart, the legend appears on the right edge 
of the chart, providing the name of the y values (in this case values from the 
Graduated column). If there is only one set of data values in the chart you 
usually do not need a legend. 
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To remove the legend: 

I Click the Legend button from the Labels group in the Layout tab of 
the Chart Tools ribbon and then click None. 

Excel removes the legend from the scatter plot 


From the Legend button you can also choose commands to move the 
legend to different locations relative to the chart area and to format the 
legend’s appearance including its size, fill color, and font styles. 


Editing Plot Symbols 


As with other parts of the chart. Excel allows the user to modify the display 
of the plot symbols. By default. Excel uses a blue diamond as the plot sym¬ 
bol. You’ll change this to an empty circle. There is no button on the Chart 
Tools ribbon to modify the appearance of the symbols; instead you must se¬ 
lect the symbols and format them using the Format Selection button on the 
Layout tab. Try this by changing the appearance of the scatter plot points to 
empty circles. 



To change the plot symbol: 

Click any of the plot symbols in the scatter plot. Selection handles 
will appear around all of the scatter plot data points. 

Click the Format Selection button from the Current Selection group 
on the Layout tab of the Chart Tools ribbon. Excel opens the Format 
Data Series dialog box. 

The Format Data Series dialog box is divided into different tabs that 
allow you to format the appearance of the plot symbols used in the 
selected data series. 

Click the Marker Options tab and then click the Built-in option but¬ 
ton. Click the Type drop-down list box and select the circle symbol. 

Click the Size list box and increase the circle size to 10. Figure 3-18 
shows the selected options from the dialog box. 
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Figure 3-18 
Selecting a 
plot symbol 



5 Click the Marker Fill tab and then click the No fill option bntton. 

6 Click the Close bntton to accept the changes to the plot symbols. 

7 Click ontside of the chart sheet to deselect. Fignre 3-19 shows the 
revised appearance of the gradnation chart. 


Chapter 3 Working with Charts 103 











Figure 3-19 
Revised 
symbols for the 
graduation chart 
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EXCEL TIPS_ 

^ . • Excel has several different built-in chart layouts for making 

quick changes to several chart objects at the same time. You can 
view a gallery of the chart layouts by clicking the Chart Layouts 
button located on the Design tab of the Chart Tools ribbon. 

• You can change the type of chart displayed by Excel by selecting 
a chart and then clicking the Change Chart Type button located 
in the Type group of the Design tab in the Chart Tools ribbon. 

The Change Chart Type will then display a list of all chart types 
and chart templates stored on your computer. 

• If you want to reuse all of the formatting choices you made for 
your chart in future charts, you can save your chart as a template 
by clicking the Save As Template button from the Type group on 
the Design tab of the Chart Tools ribbon. 


Now that you’ve formatted the chart, interpret the scatter plot you’ve 
created. One of the questions you were asking was What is the relation¬ 
ship (if any) between the average SAT score of a freshman class and its 
eventual graduation rate? You can now put forward one hypothesis: 
Higher average SAT scores seem to be associated with higher graduation 
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rates. That’s not surprising, but there are a couple of exceptions to that 
relationship. For example, a freshman class of students at one univer¬ 
sity showed an average SAT score of 1140 with a graduation percentage 
of 58%, which might be lower than would be expected on the basis of 
the graduation rates for the other universities with similar average SAT 
scores. Which university is it? 


Identifying Data Points 

When you plot data, you often want to be able to identify individual points. 
This is particularly important for values that seem unusual. In those cases, you 
might want to go back to the source of the data and check to see whether there 
were any anomalies in how the data were collected and entered. You may have 
already noticed that if you pass your mouse cursor over the selected data points 
in the BigTen scatter plot, a screen tip appears to identify the data series name 
as well as the pair of values used in plotting the point (see Figure 3-20). 


Figure 3-20 
Screen tip 
identifying 
a data point 
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Although this information is interesting and potentially helpful, it doesn’t 
tell you more about the source of the data point. For example, which uni¬ 
versity supplied this particular combination of SAT score and graduation 
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percentage? One way to find ont is to compare the valnes given in the pop¬ 
up label with the values in the worksheet. For example, you could return to 
the Grad Percents worksheet to see that the point identified in Figure 3-20 
is from the University of Minnesota (MINN), whose freshman class had an 
SAT average of 1140 and an eventual graduation rate of 58%. In this fashion 
you could continue to compare values between the chart and the worksheet, 
finding out which university is associated with which data point. Of course, 
this is time consuming and impractical, especially for larger data sets. Excel 
doesn’t provide any other method of identifying specific points, but the Stat- 
Plus add-in that comes with this book does provide some additional com¬ 
mands for this purpose (if you haven’t installed StatPlus, please read the 
material in Chapter 1 about StatPlus and installing add-ins). 


Selecting a Data Row 


One of the StatPlus commands you can use to identify a particular row is 
the Select Row command. This command works only if your data values 
are organized into columns. To use this command, you select a single point 
from the chart and then click Select Row from the StatPlus menu. Try this 
now and identify the university that had the highest graduation percentage 
in the Big Ten. 



To select a data row: 

Click a data point in the scatter plot in order to select the entire data 
series. 

Click the plot symbol highlighted in Figure 3-20 where the SAT 
value is equal to 1140 and the graduation percentage is equal to 58. 
Now only that plot symbol should be selected and none of the other 
symbols. 

Click StatPlus from the Menu Commands group on the Adds-Ins 
tab. 

Click Select Row from the StatPlus menu. 

The eighth row should now be highlighted, indicating that the Uni¬ 
versity of Minnesota (MINN) is the university that had the highest 
graduation percentage in the Big Ten (see Figure 3-21). 
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Figure 3-21 
Selecting 
the data row 
for a plot 
point 



5 


Click cell Al to remove the highlighting. 


Labeling Data Points 

You can also use the StatPlus add-in to attach labels to all of the points in 
the data series. These labels can be linked to text in the worksheet so that if 
the text changes, the labels are automatically updated. Use StatPlus now to 
add the university name to each point in the chart. 

To add labels to the chart: 

1 Return to the Graduation Chart chart sheet and then click outside of 
the chart to deselect it. 

2 Click a plot symbol from the chart again to reselect all of the plot 
symbols. 

3 Click Label series points from the StatPlus menu located in the 
Menu Commands group on the Adds-Ins tab. Excel opens the Label 
Point(s) dialog box. 

Most StatPlus commands give you the choice of entering range 
names or range references. Because range names have already been 
created for this workbook, you can select the appropriate range 
name from a list box. In this case, you’ll use the text entered into the 
University column from the worksheet. 

4 Click the Labels button to open the Input Options dialog box and then 
click the Use Range Names option button, if necessary, to select it. 

5 Scroll down the list box and click University and then click the OK 
button. 

6 Click the Link to label cells checkbox. 
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By linking to the label cells, you ensure that any changes you make 
to text in the University column will be automatically reflected in 
the labels in the scatter plot. 

7 Click the OK button. 

8 Click outside the chart to deselect. Your chart should resemble the 
one shown in Figure 3-22. 


Figure 3-22 
Identifying 
plot points by 
university 
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STATPLUS TIPS_ 

• If you want to use the same text font and format in the worksheet 
and in the labels, click the Copy label cell format checkbox in 
the Label Point(s) dialog box. 

• If you want to replace the plot symbols with labels, click the 
Replace points with labels checkbox in the Label Point(s) dialog 
box. Be aware, however, that once you do this, you cannot go 
back to displaying the plot symbols. 

• To label a single point rather than all of the points in the data 
series, select only a single plot symbol and then apply the Label 
Point(s] command from the StatPlus menu. 
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When you label every data point, there is often a problem with overcrowd¬ 
ing. Points that are close together tend to have their labels overlap, as is the 
case with the Iowa, Ohio State, and Purdue labels in Figure 3-22. This is not 
necessarily bad if you’re interested mainly in points that lie outside the norm. 


Formatting Labels 

You’ve learned that the Big Ten university that has a low graduation rate 
for its students relative to the average SAT score of its freshman class is 
Minnesota. You might wonder why its graduation rate is so much lower 
than the rates for the other universities. On the basis of the values in the 
chart, you would expect a graduation rate between 65 and 75% for an aver¬ 
age SAT score of around 1140, not one as low as 58%. Perhaps it is because 
Minneapolis-St. Paul is the largest city among Big Ten towns, and students 
might have more distractions there, or the composition of the student body 
might be different. Columbus is the next largest city, and Ohio State is next 
to last in graduation rate with 66%, which seems to verify this hypothesis. 
On the other hand. Northwestern is in Evanston, right next door to Chicago, 
the biggest midwestern city, so you might expect it to have a low graduation 
rate too. However, Northwestern is also a private school with an elite stu¬ 
dent body and high admission standards and has a graduation rate of 93%. 

Minnesota’s graduation rate still seems curious. You decide to mark this 
point for further study by changing the color of the label to boldface red. 

To format a label: 

1 Click any label in the chart to select all of the labels in the data 
series. 

Note that selection handles appear around each label. If you wanted to 
format all of the labels simultaneously, you could do this by applying 
any of Excel’s formatting commands to this selected group. To format a 
single label, you have to select it again from the group of labels. 

2 Click the MINN label. 

The selection label is now limited to only the Minnesota point. 

3 Click the Font Color button from the Font group on the Home 
tab and change the font color to red (the second entry in the list of 
standard colors). 

4 Click the Bold button ® from the Font group. 

5 Click outside the chart to deselect it. The format of the MINN data 
label should now be boldface red. 
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EXCEL TIPS_ 

^ • You can also select and format yonr data labels by clicking the 

Data Labels bntton located in the Labels gronp of the Chart Lay- 
ont tab. 


Creating Bubble Plots 


Let’s examine another possible impact on the gradnation rate. The data set 
also inclndes the percentage of all freshmen who gradnated in the top 25% 
of their high school class. We conld create a scatter plot of the Gradnated 
colnmn valnes versns the Top 25% colnmn valnes. However it may be more 
instructive to include the calculated SAT values in the chart. One way of 
observing the relationship among the three variables is throngh a bnbble 
plot. A bubble plot is similar to a scatter plot, except that the size of each 
point in the plot is proportional to the size of a third valne. In this case, 
we’ll create a bnbble plot of gradnation rate versns SAT average, the size 
of each plot symbol being determined by the percentage of incoming fresh¬ 
men who gradnate in the top 25% of their class. Note that we won’t prove 
that this valne affects the gradnation rate; we are merely exploring whether 
there is graphical evidence to snggest snch a relationship. Bnbble plots are 
another chart type snpported by Excel and can be easily created nsing the 
Insert tab. 


I 


2 


To insert a bubble plot: 

Return to the Grad Percents worksheet and select the nonadjacent 
cell range Dl:El2;Kl:Kl2. 

The order of the columns is important in a bubble plot. The values 
for the X axis should be listed first, then the values for the y axis, fol¬ 
lowed by the values that determine the size of the plot bubbles. 

Click the Insert tab and then click the Other Charts button from the 
Charts group on the ribbon; then as shown in Figure 3-23, select the 
first Bubble chart option from the menu. 
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Figure 3-23 
Identifying 
plot points by 
university 





Excel inserts the unformatted bubble plot as an embedded chart on 
the Grad Percents worksheet (see Figure 3-24). 


Figure 3-24 
The initial 
bubble 
plot as an 
embedded 
chart object 
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As you did earlier in creating the scatter plot, yon now have to format the 
appearance of the chart to make it easier to interpret and nnderstand. You’ll 
move the chart to its own chart sheet, add titles, and change the axis scales. 
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To format the bubble plot: 

With the chart still selected click the Move Chart hntton located in 
the Location gronp of the Design tah nnder the Chart Tools rihhon. 

Click the New Sheet option hntton in the Move Chart dialog hox, 
enter Bubble Chart in the New Sheet text hox and then click the OK 
hntton. 

Click the chart title and change it from Gradnated to Graduation 
Rates for Different Top 25 Percent Rates. 

Click the Legend button from the Labels group on the Layout tab 
of the Chart Tools ribbon and then click None to remove the chart 
legend. 

Click the Axis Title button from the Labels group, click Primary 
Horizontal Axis and then click Title Below Axis. Type Calcu¬ 
lated SAT Values for the horizontal axis title. Set the font size to 
14 points. 

Click the Axis Title button again from the Labels group, click 
Primary Vertical Axis, and then click Rotated Title. Type Graduation 
Percentages for the vertical axis title. Set the font size to 16 points. 

Now change the scale of the horizontal axis to go from 1000 to 1500 
in intervals of 50 points. 

Click the Axes button from the Axes group on the Layout tab of the 
Chart Tools ribbon and then click More Primary Horizontal Axis 
Options. 

Within the Axis Options tab, set the Minimum fixed value to 1000, 
the Maximum fixed value to 1500, and the Major Unit value to 50. 
Click the Close button. 

Change the scale of the vertical axis to range from 50 to 100 in steps 
of 5. 

Click the Axes button and then click More Primary Vertical Axis 
Options. Within the Axis Options tag of the Format Axis button set 
the Minimum value to 50, the Maximum value to 100, and the Major 
Unit value to 5. Click the Close button. 
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Figure 3-25 shows the current appearance of the huhhle plot. 


Figure 3-25 
The initial 
bubble plot as an 
embedded chart 
object 



The default appearance of the huhhle symbols makes it difficult to sepa¬ 
rate one huhhle from another. You can modify the plot symbols to make the 
chart easier to read and interpret. 

To format the bubble symbols: 

1 Click one of the bubbles in the chart to select all of the bubble 
symbols. 

2 Click the Format tab from the Chart Tools ribbon and then click the 
Format Selection bntton from the Current Selection gronp. 

3 Excel opens the Format Data series dialog box. Click the Fill tab and 
then click the Color button . and select Yellow from the list of 
standard colors. 

4 Fill colors are, by default, solid; but you can allow them to become 
partially transparent so that overlapping bubbles can be distin- 
gnished from one another. Drag the Transparency slider located 
below the Color button to the valne 66%. Figure 3-26 shows the 
modified fill color scheme for the bubbles. 
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Figure 3-26 
Setting the fill 
color from the 
bubble symbols 



5 You still need to specify a border color for the different bnbbles so 
that yon can see where one bnbble begins and the others end. Click 
the Border Color tab and then click the Solid line option bntton. 

6 Click the Close bntton and then deselect the chart. The revised chart 
appears in Fignre 3-27. 
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Figure 3-27 
Revised 
bubble plot 
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The area of each huhhle in the plot is proportional to the percentage of in¬ 
coming freshman-athletes who gradnated in the top 25% of their class. Yon 
can change this so that the width of each hnhhle is proportional to that valne. 
In some sitnations, this works better in displaying differences between one 
data point and another. Yon can also make the bnbbles smaller so that there 
is less overlap. Investigate the effect of changing the hnhhle symbol size on 
the appearance of yonr chart. 

To change the huhhle size: 

1 Click one of the bnbbles in the chart to select all of the bubble 
symbols again. 

2 Return to the Format Data series dialog box by clicking the Format 
tab from the Chart Tools ribbon and then click the Format Selection 
button from the Current Selection group. 

3 If necessary click the Series Options tab and then click the Width 
of huhhles option button and enter 50 in the Scale bubble size to 
input box. This will reduce the width of the bubbles to 50% of their 
default size (see Figure 3-28). 
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Figure 3-28 
Setting the 
width of the 
plot bubbles 



4 Click the Close button and then deselect the chart. The final appear¬ 
ance of the bnbble chart is shown in Fignre 3-29. 
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Graduation Parcantacas for Diffarant Top 2i% Ratas 


Figure 3-29 
Final 
bubble plot 
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Let’s evaluate what we’ve created. In interpreting bnbble plots, the stat¬ 
istician looks for a pattern in the distribntion of the bnbbles. Are bnbbles of 
similar size all clnstered in one area on the plot? Is there a progression in 
the size of the bnbbles? For example, do the bnbbles increase in size as we 
proceed from left to right across the plot? Is there a bnbble that is markedly 
different from the others? In this plot, we notice immediately that the smaller 
bnbbles seem to be clnstered more toward the left end of the plot. This wonld 
indicate that schools which have a lower percentage of incoming freshmen 
that gradnate in the top qnarter of their class also tend to have a lower nlti- 
mate gradnation rate. However, it’s also interesting that the bnbble represent¬ 
ing Minnesota is slightly larger than its snrronnding bnbbles indicating that 
we probably cannot argne that Minnesota’s lower gradnation rate is dne to a 
lower nnmber of incoming stndents who gradnated in the top qnarter of their 
class. We wonld probably have to do fnrther research to discover a reason 
from Minnesota’s slightly lower gradnation rate. 


Breaking a Scatter Plot into Categories 

Bnbble plots have the problem that it is not always easy to compare the 
relative sizes of different bnbbles, so another approach we can take is to 
divide the nniversities into categories, plotting nniversities from different 
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categories with different symbols. For example we can divide the nniversi- 
ties into two gronps: one in which the percentage of freshmen who gradn- 
ated in the npper qnarter of their class is less than 80% and another gronp 
consisting of schools in which 80% or more of freshmen gradnated in the 
npper qnarter of their high school class. 

To do this with Excel, yon have to copy the valnes for these nniversities 
into two separate colnmns and then recreate the scatter plot, plotting two data 
series instead of one. That can be a time-consnming process. To save time, yon 
can nse StatPlns to break the scatter plot into categories for yon. You’ll try this 
now, using the Top 25 Rate column to determine the category values (<80% or 
>=80%). First, you’ll make a copy of the scatter plot you created earlier. 


Figure 3-30 
Moving or 
copying a 
chart sheet 


To copy the scatter plot: 

1 Right-click the Graduation Chart sheet tab and select Move or Copy 
from the pop-up menu that appears. Excel opens the Move or Copy 
dialog box. 

2 Click the Create a copy checkbox and select Bubble Chart from the 
list of chart sheets as shown in Figure 3-30. 



3 Click the OK button. 

Excel inserts a new chart sheet named Graduation Chart (2) directly 
before the Bubble Chart chart sheet. 

4 Click any one of the plot symbol labels to select them all and then 
press the Delete key. 
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The plot labels disappear (you won’t be using them in the plot you’ll cre¬ 
ate next). Now, break the points in the scatter plot into two categories on the 
basis of the values in the Size column. 


To break the scatter plot into categories: 

1 Click Display series by category from the StatPlus menu located in 
the Menu Commands group on the Add-Ins tab. 

2 Click the Categories button. 

3 Verify that the Use Range Names option button is selected, click 
Top_25_Rate from the list of range names, and click OK. 

4 Click the Bottom option button to display the categories’ legend at 
the bottom of the scatter plot. 

5 Click the OK button. Figure 3-31 displays the scatter plot broken 
down by categories. 


Figure 3-31 
Breaking 
the scatter 
plot into 
categories 
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STATPLUS TIPS_ 

• Once you break a scatter plot into categories, you cannot go back 
to the original scatter plot (without the categories). If this is a prob¬ 
lem, create a copy of the scatter plot before you break it down. 

• If your chart contains several data series, you can choose which 
series to break down into categories by selecting the series name 
from the Select a Series drop-down list box in the Display by 
Series Category dialog box. 


With the data series now broken down into categories, we can compare 
the universities’ graduation rates on the basis of the high school perfor¬ 
mance of its incoming freshman students. On the basis of the chart, we 
quickly see that schools in which 80% or more of the incoming freshmen 
graduate in the upper quarter of their high school have higher graduation 
percentages whereas those schools in which less than 80% of incoming 
freshmen graduate in the upper quarter see a lower percentage of those stu¬ 
dents graduate. 

So we have two predictors of graduation percentage. One is the calculated 
SAT score, and the other is percentage of incoming freshmen that graduate 
in the upper 25% of their class. Notice, however, that within each category 
(<80% and > = 80%), there does not seem to be a clear trend based on cal¬ 
culated SAT values. 

Essentially both measures are telling us the same general thing: uni¬ 
versities that attract better student athletes will also have higher gradua¬ 
tion percentages. That’s not surprising. What might be useful, however, is 
determining which of the calculated SAT score or the high school gradu¬ 
ation ranking is the better predictor. That is not something we can easily 
determine from the chart. To answer that question, we have to perform a 
statistical analysis of the data, which we’ll do in a future chapter. 

There are other factors which we haven’t investigated. Does the size or 
the location of the school matter? Does it make a difference if the university 
is a public or private institution? And we have to be aware that we are only 
looking at a sample of 11 universities; we don’t know if any of our conclu¬ 
sions can apply to another sample of schools. All of these are questions for 
future study. 


Plotting Several Variables 
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Before finishing, let’s explore one more question. The data include gradu¬ 
ation rates for the athletes in the freshman class broken down by gender 
and race. How do these graduation rates compare? A scatter plot displaying 






Figure 3-32 
Plotting 
the white 
male and 
female 
graduation 
rates 


these results needs to have several data series. You can create such a scatter 
plot hy simply selecting additional columns of data to he plotted on the 
y axis of the chart. In this example you’ll plot the graduation rates for white 
male and female athletes. 

To create the scatter plot: 

1 Return to the Grad Percents worksheet and select the nonadjacent 
cell range Dl:Dl2;Fl:Fl2;Hl:Hl2. 

2 Click the Scatter button from the Charts group on the Insert tah and 
then click the first Scatter suhtype, displaying a scatter plot with 
only markers. Excel inserts an embedded scatter plot as shown in 
Figure 3-32. 



Move the scatter plot to its own chart sheet named Graduation Chart 
hy Gender. 

Excel does not automatically add a chart title when there is more 
than one data series being plotted. To insert a title, click the Chart 
Title button located in the Labels group of the Layout tab on the 
Chart Tools ribbon and then click Above Chart. Enter Graduation 
Percentage by Gender for the title. 
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5 As you did for the other scatter plots, change the title of the x axis to 
Calculated SAT Values and the title of the y axis to Graduation Percent¬ 
ages. Set the font size of the titles to 14 points and 16 points respechvely. 

6 Change the scale of the x axis to range from 1000 to 1500 in intervals 
of 50 points. Change the scale of the y axis to range from 50 to 100 in 
intervals of 5 points. 

7 Click the Gridlines button from the Axes group on the Layout tah 
of the Chart Tools rihhon and then click Primary Vertical Gridlines 
and Major Gridlines to add gridlines to the chart. Figure 3-33 shows 
the final formatted version of the scatter plot. 


Figure 3-33 
Breaking the 
scatter plot 
into categories 
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The scatter plot shows that the white female athletes generally have higher 
graduation rates than the white male athletes. Does this tell us something 
about white female and male athletes? Perhaps, but we should bear in mind 
that this chart plots the average graduation rates for these two groups against 
the average SAT score for the entire class of incoming freshmen. We don’t 
have data on the average SAT score for incoming freshman male athletes or 
incoming freshman female athletes. It’s possible that the female athletes also 
had higher SAT scores than their male counterparts, and thus we would ex¬ 
pect them to have higher graduation rates. On the other hand, if their SAT 
scores are comparable, we might look at the college experiences of male and 
female athletes at these universities to see whether this would have an effect 
on graduation rates. Are the demands on male athletes different from those 
on female athletes, and does this affect the graduation rates? 
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STATPLUS TIPS_ 

• You can quickly create your own scatter plots using StatPlus. To 
create a scatter plot with the add-in, click Single Variable Charts 
and then Fast Scatterplot from the StatPlus menu. 

• In the same way you can quickly create your own huhhle plots 
using StatPlus. To create a huhhle plot, click Multi-Variable 
Charts and then Fast Bubble Plot from the StatPlus menu. 


You’ve completed your work on the Big Ten scatter plots. You can now 
close the workbook, saving your changes. 


Exercises 

1. You decide to investigate whether there 
is any relationship between the gradu¬ 
ation rate for student athletes and the 
race of the athlete. To do this, open the 
Big Ten workbook you examined in 
this chapter and perform the following 
analyses: 

a. Open the Big Ten workbook from the 
ChapterOZ folder and save it as Big 
Ten Graduation by Race. 

b. Create a scatter plot with the calcu¬ 
lated SAT score on the x axis and the 
white male and black male gradua¬ 
tion rates on the y axis. Title the chart 
Graduation Percentages by Race. 
Title the y axis Graduation Percent¬ 
ages and the x axis Calculated SAT 
Values. Move your scatter plot to a 
chart sheet named Male Graduation 
by Race. 

c. Edit the scale of the x and y axes to 
reflect the range of data values. 

d. Add labels to the scatter plot, iden¬ 
tifying each university. Compare the 
black male graduation rate for each 
university to the corresponding white 
male graduation rate. What do you 
observe? 


e. Repeat your analysis, this time 
comparing female graduation rates 
by race. Save your graph on a chart 
sheet named Female Graduation 
by Race. 

f. Save your workbook and write 
a brief report summarizing your 
observations. 

2. In the 1980s female professors at a 
junior college sought help from statisti¬ 
cians to show that they were underpaid 
relative to their male counterparts. The 
legal action was eventually settled out 
of court. Investigate their claim by creat¬ 
ing scatter plots of the salary data they 
acquired for their case. 

a. Open the Junior College workbook 
from the ChapterOZ folder and save it 
as Junior College Salary Charts. 

b. Create a scatter plot with Salary on 
the y-axis and Years on the x-axis. 
Title the scatter plot Employee 
Salaries. Title the y axis Salary and 
the X axis Years Employed. Remove 
the legend and gridlines from the 
plot. Save the plot as a chart sheet 
named Salary by Gender Chart. 
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c. Break the scatter plot points into 
two categories on the basis of gen¬ 
der. Does the plot snggest that male 
salaries tend to he higher than female 
salaries for comparable years of 
employment? 

d. Examine the list of other variables 
in the workbook. Are there other 
variables in that list which shonld be 
taken into acconnt before coming to 
a conclnsion abont the relationship 
between gender and salary? 

e. Save yonr workbook and write 
a report summarizing your 
observations. 

3. Admission decisions to colleges 
are often partly based on ACT math 
scores and high school rank with the 
expectation that these scores are related 
to success in college. Is this always the 
case and is gender a factor? You’ve been 
provided with a data set to investigate 
this question. The data set contains 
columns for gender, high school rank 
(HS Rank), American College Testing 
Mathematics test score (ACT Math), and 
an algebra placement test score (Alg 
Place) from the first week of class and 
the final first semester calculus grades 
(Calc) for a group of students [see Edge 
and Friedberg (1984)]. Graph the data to 
investigate what kind of relationships 
appear to exist between the variables. 

a. Open the Calculus workbook from 
the ChapterOS folder and save it as 
Calculus ACT Charts. 

b. Create a scatter plot, on a separate 
chart sheet, plotting Calc on the y axis 
and ACT Math on the x axis. Label 
the axis appropriately. How strong 
does the relationship between the 
ACT Math score and the first semester 
calculus score appear to you? 

c. Break down the scatter plot by gender. 
Is there evidence of a difference in 
calculus scores based on gender? 


d. Repeat steps a through c for a scatter 
plot relating calculus grades to the 
algebra placement test. 

e. Save your workbook and then write 
a report summarizing your findings. 

4. You’ve been given a data set contain¬ 
ing the mass and volume measurements 
from eight chunks of aluminum as 
recorded in a high school chemistry lab. 
Graph and examine their findings. 

a. Open the Aluminum workbook from 
the ChapterOS folder and save it as 

Aluminum Chart. 

b. Create a scatter plot with mass values 
on the y axis and volume values on 
the X axis. Add major gridlines for 
both the X axis and the y axis. 

c. Examine your chart. There should be 
a data value that appears out of place. 
Mark this point by changing the plot 
symbol used for that point to a differ¬ 
ent color from the rest of the points. 

d. Do the other points seem to form 
a nearly straight line? The ratio of 
mass to volume is supposed to be a 
constant (the density of aluminum), 
so the points should fall on a line 
through the origin. Draw the line, 
and estimate the slope (the ratio of 
the vertical change to the horizontal 
change) along the line. What is your 
estimate for the density of aluminum? 

e. Save your workbook and write a 
report summarizing your findings. 

5. You’ve been asked to investigate the 
relationship between protein and carbo¬ 
hydrates in several brands of wheat cere¬ 
als and breads. Data taken from a trip to 
a local grocery store has been recorded 
and saved for you. 

a. Open the Wheat workbook from the 
ChapterOS folder and save it as Wheat 
Charts. 

b. Create a scatter plot with Protein on 
the y axis and Carbo on the x axis. 
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Label the axes appropriately. Save the 
chart into a chart sheet named Protein 
Chart. 

c. Label each point in the scatter plot 
with its food name. 

d. Create a bnbble plot of protein versns 
carbohydrates with the size of the 
bnbble determined by the amonnt 

of sngar contained in each prodnct. 
Format the chart to make it easy to 
read and interpret. Which plot points 
show the largest bnbbles (and thns the 
largest sngar content) and what food 
do they represent? 

e. Sngar does not contain protein. 
Explain how this fact is reflected in 
the placement of the two food prod- 
nets with the largest sngar content in 
the bnbble plot yon jnst created. 

f. Save yonr workbook and write a 
report snmmarizing yonr observations 
from the graph yon created. 

6. How well does a player’s salary match 
np with his career batting average? 
Yon’ve been given performance and 
salary data of major leagne players from 
the beginning of the 2007 season (non¬ 
pitchers only). Analyze the relationship 
between performance and pay. 

a. Open the Baseball workbook from 
the ChapterOS folder and save it as 
Baseball Salaries Chart. 

b. Create a scatter plot with salary on 
the y axis and career batting average 
(AVG) on the x axis. Label the axis 
appropriately and save the plot in a 
chart sheet named Salary Chart 

vs. AVG. 

c. Change the lower range of the x axis 
scale to 0.15. 

d. Identify on the plot the last name of 
the player with the highest salary. 
How does this player’s batting average 
compare to the other players in the 
sample? 


e. Salary valnes can vary in magnitnde 
from aronnd 200,000 np to more 
than 20,000,000. You can adjust for 
this scatter by plotting the data on a 
logarithmic chart. Copy your chart 
sheet to a new sheet named Salary Log 
Chart and then change the property of 
the vertical axis to show the axis with a 
log scale over a range from 300,000 up 
to 30,000,000. How does the log scale 
affect the vertical scatter of the data? 

f. Create a scatter plot of salary vs. home 
runs (HR). Save the chart on a chart 
sheet named Salary Chart vs. HRs. 
Compare this chart to the one you 
created in the Salary Chart vs. AVG 
sheet. Which chart shows the stron¬ 
ger relationship? Which do owners 
appear to value more: batting average 
or homeruns? 

g. There is a player in your chart who 
appears to be underpaid for the 
number of homeruns hit. Identify this 
player. Examine how many seasons 
the player has been in the league. 
Explain how this could have affected 
his salary value. 

h. Create a bubble plot of salary vs. bat¬ 
ting average on a new chart sheet 
named Salary vs. Avg Bubble Chart 
with the values of the HR column 

to determine the size of each bubble 
(Note: Due to the order of the columns 
in the Player Data worksheet, you can 
more easily create the bubble plot 
using the StatPlus Fast Bubble Plot 
command available under Multi- 
variable Charts menu). Set the scale of 
the x-axis to range from 0.15 to 0.35 
in intervals of 0.05. Display the y-axis 
on a log scale ranging from 300,000 to 
30,000,000. Make the bubble symbols 
partly transparent with the width of 
the bubble scaled down to 25. Based 
on your bubble plot where are the larg¬ 
est bubbles (and thus the players with 
the most home runs) concentrated? 
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i. There appears to be a player whose 
reported salary is lower than expected 
for his combination of batting average 
and homernns. Identify that player. 

j. Save yonr workbook and write a 
report snmmarizing yonr observa¬ 
tions. Which is more important 
in determining the player salary: 
homernns or batting average? 

7. Working at the Bnrean of Labor 
Statistics, James Longley tracked several 
variables related to the United States 
economy from 1947 to 1962. Stndy the 
data he collected. 

a. Open the Longley workbook from the 
ChapterOS folder and save the work¬ 
book as Longley Graph. 

b. The workbook has seven colnmns 
related to the economy. The Total col- 
nmn displays the total U.S. 
employment in thonsands. The Arm- 
force colnmn displays the total nnmber 
of people in the armed forces, listed 
again in thonsands. Create a scatter 
plot of Total versus Armforce. Label 
and scale the chart appropriately. 

c. Add labels to each plot point indicat¬ 
ing the year in which the datum was 
recorded. 

d. Four points on the lower left of the 
scatter plot stand out. Examine the 
economic and political history of the 
era to explain why these values are so 
distinct from all of the others. 

e. Aside from the four points in the 
lower left corner of the plot, describe 
the general relationship between 
total employment and the number of 
people in the armed forces. 

f. Save your workbook and write 
a report summarizing your 
observations. 

8. What is the relationship between race 
results and reaction time (the time it 
takes for the runner to leave the starting 


block after hearing the sound of the 
starting gun]? You’ve been given a work¬ 
book containing the race results and 
reaction times from the first round of 
the 100-meter heats at the 1996 Summer 
Olympic games in Atlanta. Graph the 
data to investigate the effect that reac¬ 
tion time has on race results. 

a. Open the Race workbook from the 
ChapterOS folder and save it as Race 
Graphs. 

b. Create a scatter plot of race time 
versus reaction time. Label and scale 
the chart appropriately. Do you see a 
trend that would indicate that runners 
with faster reaction times have faster 
race times? 

c. There is a point that lies away from 
the others. Identify the runner corre¬ 
sponding to this point. 

d. Copy your chart to another chart sheet 
and then rescale the axes, setting the 
X axis range to 0.12 to 0.24 seconds 
and the y axis scale to 9.5 to 12.5 
seconds. Is there any more indication 
that a relationship exists between 
reaction time and race time? 

e. Save your workbook and write a 
summary including a comment on 
how the scale used in plotting data 
can affect your perception of the 
results. 

9. The Cars workbook contains informa¬ 
tion on car models from Consumer 
Reports®, 2003-2008. Data in the 
workbook include the miles per gallon 
(MFC) of each car as well as the time to 
accelerate from 0 to 60, weight, horse¬ 
power, price, etc. See Exercise 10 of 
Chapter 2. 

a. Open the Cars workbook from the 
ChapterOS folder and save it as Car 

Graphs. 

b. Create a scatter plot on a separate 
chart sheet of MFC (on the p-axis] 
versus horsepower (on the x-axis). 
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c. Label the points that are highest 
in MPG as measnred by the height 
above the average MPG for those v\rith 
aronnd the same horsepower. Label 
each of these points with the Model 
(make snre that yon select only these 
points rather than all of the points— 
or else yon’ll wait a long time for all 
of the labels to be added to the scatter 
plot). What do these cars have in com¬ 
mon? [Hint Each does not have jnst 

a traditional gasoline engine). Print 
yonr chart. 

d. Gopy yonr scatter plot to a second 
chart sheet. Break down the plot 
by company region. Do yon see a 
relationship between Region and 
MPG? Which region has lowest 
MPG, on the average, for each given 
horsepower? 

e. Greate a bnbble plot on a separate 
chart sheet with MPG on the y-axis, 
horsepower on the x-axis and the size 
of each bnbble determined by the 
price of the car. 

f. Rescale the bnbbles to 50% of the 
defanlt and relate the size of the price 
colnmn to the width of the bnbbles. 
Print the chart. 

g. Is the price of the car related to the 
horsepower and the gas mileage? 
How? Describe the relationship and 
explain why it shonld be expected. 
Print the rescaled chart. Save yonr 
changes to the workbook. 

10 . Voting resnlts for two presidential elec¬ 
tions have been recorded for yon in an 
Excel workbook. The workbook contains 
the percentage of the vote won by the 
Democratic candidate in 1980 and 1984, 


broken down by state. Graph and ana¬ 
lyze the results. 

a. Open the Voting workbook from the 
GhapterOS folder and save it as Voting 
Graphs. 

b. Greate a scatter plot of Deml984 
versus Deml980 on a separate chart 
sheet. 

c. Rescale the axes so that the minimum 
value for the x-axis and the _y-axis is 
20 and the maximum value is 60. 

d. Examine the scatter plot. Does the 
voting percentage in 1984 generally 
follow the voting percentage from 
1980? In other words, if the Demo¬ 
cratic candidate received a large per¬ 
centage of the vote from a particular 
state in 1980, did he or she do as well 
in 1984? 

e. In one state, the candidate had a large 
percentage of the vote in 1980 (above 
55%) but a small percentage of the vote 
in 1984 (about 40%). Identify this state. 

e. Greate a copy of the scatter plot on a 
separate chart sheet. Break down this 
new scatter plot by region. 

f. Examine the location of the southern 
states in the scatter plot. Do they fol¬ 
low the general pattern shown by 
the other points in the plot? Interpret 
your answer in light of what you 
know of the 1980 and 1984 elections. 
[Hint: Gonsider whether the fact that the 
1980 election involved a southern Dem¬ 
ocratic candidate and the 1984 election 
involved a midwestern Democratic 
candidate caused a change in the voting 
percentages of the southern states.) 

g. Save your workbook and write a 
report summarizing your conclusions. 
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Chapter 4 


Describing Your Data 



In this chapter you will learn: 

p- About different types of variables 

p- How to create tables of frequency, cumulative frequency, percentages, 
and cumulative percentages 

P- How to create histograms and break histograms down by groups 
P- About creating and interpreting stem and leaf plots 
P- How to calculate descriptive statistics for your data 
P- How to create and interpret box plots 
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C hapter 4 introduces the different tools that statisticians use to 

describe and summarize the values in a data set. You’ll work with 
frequency tables in order to see the range of values in your data. 
You’ll use graphical tools like histograms, stem and leaf plots, and 
boxplots to get a visual picture of how the data values are distributed. 
You’ll learn about descriptive statistics that reduce the contents of your 
data to a few values, such as the mean and standard deviation. Applying 
these tools is the first step in the process of evaluating and interpreting 
the contents of your data set. 


Variables and Descriptive Statistics 

In this chapter you’ll learn about a branch of statistics called descriptive 
statistics. In descriptive statistics we use various mathematical tools to 
summarize the values of a data set. Our goal is to take data that may con¬ 
tain thousands of observations and reduce it to a few calculated values. For 
example, we might calculate the average salaries of employees at several 
companies in order to get a general impression about which companies pay 
the most, or we might calculate the range of salaries at those companies to 
convey the same idea. 

Note that we should be very careful in drawing any general conclusions 
or making any predictions on the basis of our descriptive statistics. Those 
tasks belong to a different branch of statistics called inferential statistics, 
a topic we’ll discuss in later chapters. The goal of descriptive statistics is 
to describe the contents of a specific data set, and we don’t have the tools 
yet to evaluate any conclusions that might arise from examining those 
statistics. 

When descriptive statistics involve only a single variable, as they will 
in this chapter, we are employing a branch of statistics called univariate 
statistics. Now we’ve used the term variable several times in this book. 
What is a variable? 

A variable is a single characteristic of any object or event. In the last 
chapter, you looked at data sets that contained several variables describing 
graduation rates of the Big Ten universities. Each column in that worksheet 
contained information on one characteristic, such as the university’s name 
or total enrollment, and thus was a single variable. 

Variables can be classified as quantitative and qualitative. Quantitative 
variables involve values that come in meaningful (not arbitrary) num¬ 
bers. Examples of quantitative variables include age, weight, and annual 
income—anything that can be measured in terms of a number. The number 
itself can be either discrete or continuous. Discrete variables are quantita¬ 
tive variables that assume values from a defined list of numbers. The num¬ 
bers on a die come in discrete values (1, 2, 3, 4, 5, or 6). The number of 
children in a household is discrete, consisting of positive integers and zero. 
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Continuous variables, on the other hand, have valnes from a wide range of 
possible valnes. An individnal’s weight conld he 185, 185.5, or 185.5627 
ponnds. To he snre, there is some hlnrring in the distinction between dis¬ 
crete and continnons variables. Is salary a discrete variable or a continn- 
ons variable? From one point of view, it’s discrete: The valnes are limited 
to dollars and cents, and there is a practical npper limit to how high a 
specific salary conld go. However, it’s more natnral to think of salary as 
continnons. 

The second type of variable is the qnalitative variable. Qualitative or 
categorical variables are variables whose values fall into some category, 
indicating a quality or property of an object. Gender, ethnicity, and prod¬ 
uct name are all examples of qualitative variables. Qualitative variables are 
generally expressed in text strings, but not always. Sometimes a qualita¬ 
tive variable will be coded using numerical values. A common “gotcha” 
for people new to statistics is to analyze these coded values as quantitative 
variables. Consider the qualitative data values from Table 4-1. 


Table 4-1 Qualitative Variables 


ID 

Gender (0 = male; 

1 = female) 

Ethnicity (0 = Cancasian, 1 = African 
American; 2 = Asian; 3 = Other) 

3458924065 

1 

0 

4891029494 

0 

3 

3489109294 

0 

1 


Now all of these values were entered as numbers, but does it make sense 
to say that the average gender is 1? Or that the sum of the ethnicities is 4? 
Of course not, but if you’re not careful, you may find yourself doing things 
like that in other, more subtle cases. The point is that you should always un¬ 
derstand what type of variables your data set contains before applying any 
descriptive statistic. 

Qualitative variables can be classified as ordinal and nominal. An 
ordinal variable is a qualitative variable whose categories can be put into 
some natural order. For example, users asked to fill out a survey ranking 
their product satisfaction may enter values from “Not satisfied” all the way 
up to “Extremely satisfied.” These values are categorical values, but they 
have a clear order of ascendancy. Nominal variables are qualitative vari¬ 
ables without any such natural order. Ethnicity, state of residence, and gen¬ 
der are all examples of nominal variables. Table 4-2 summarizes properties 
of the different types of variables we’ve been discussing. 
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Table 4-2 Summary of variable properties 


Quantitative 

Variables whose values come in 
meaningful (not arbitrary) numbers 

Discrete 

Quantitative variables whose values derive 
from a list of specific numbers 

Continuous 

Quantitative variables that can assume values from 
within a continuous range of possible values 

Qualitative 

Variables whose values fall 
into categories 

Ordinal 

Qualitative variables whose categories can be 
assigned some natural order 


Nominal 

Qualitative variables whose categories cannot be 
put into any natural order 


In this chapter, we’ll work primarily with continuous quantitative vari¬ 
ables. You’ll learn how to work with qualitative variables in Chapter 7, 
where you work with tables and categorical data. 


Frequency Tables 

You’ve been asked to analyze the history of housing prices in the Southwest 
and have been given a random sample from the records of resales of homes in 
Albuquerque, New Mexico, from 2/15/1993 through 4/30/1993 [Albuquerque 
Board of Realtors). The variables in this data set have been placed in an Excel 
workbook with the range names and descriptions shown in Table 4-3. 


Table 4-3 Housing Data Set 


Range Name 

Range 

Description 

Price 

A2:A118 

The selling price of each home 

Square_Feet 

B2:B118 

The square footage of the home 

Age 

C2:C118 

The age of the home in years 

Features 

D2:D118 

The number of features available in the home 
(dishwasher, refrigerator, microwave, disposal, 
washer, intercom, skylight(s), compactor, dryer, 
handicapped-accessible, cable TV access) 

NE_Sector 

E2:E118 

Located in the northeast sector of the city (Yes or No) 

Corner_Lot 

F2:F118 

Located on a corner lot (Yes or No) 

□ffer_Pending 

G2:G118 

Offer pending on the home (Yes or No) 

Annual_Tax 

H2:H118 

Estimated annual tax paid on the home 
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To view the Housing workbook: 

Start Excel. 

□pen the Housing workbook from the Chapter04 folder. 
The workbook appears as shown in Figure 4-1. 


Figure 4-1 
The Housing 
workbook 



3 


Save the workbook as Housing Statistics. 


Creating a Frequency Table 

One of the first things we’ll examine when studying this data set is the dis¬ 
tribution of its values. The distribution is the spread of the data across a 
range of possible values. If you were thinking about moving to Albuquerque 
during the time these data were recorded, you might be interested in the 
distribution of home prices in the area. What is the range of housing prices 
in the area? What percentage of houses list for under $125,000? 

As a first step in answering these types of questions, we’ll create a fre¬ 
quency table of the home prices. A frequency table is a table that tabulates 
the number of occurrences or counts of a specific value of a given variable. 
Excel does not have a built-in command to create such a frequency table. 
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but you can use the one snpplied with the StatPlns add-in. Use StatPlns 
now to create a freqnency table of the home prices in the Honsing Statistics 
workbook. 


To create a frequency table of home prices: 

1 Click Descriptive Statistics from the StatPlns menn on the Add-Ins 
tab and then click Frequency Tables. 

2 Click the Data Values button, click the Use Range Names option 
button, and click Price. Click the OK button. 

The Frequency Table command gives you three options for organiz¬ 
ing your table. You can use discrete values so that the table is tabu¬ 
lated over individual price values, or you can organize the values 
into bins (you’ll learn about bins shortly). For now, leave Discrete as 
the selected option. 

3 Click the Output button, click the New Worksheet option button, 
and type Price Table in the New Worksheet name box. Click the OK 
button. 

4 Click OK to start generating the frequency table. Figure 4-2 displays 
the completed table. 


Figure 4-2 
Frequency of 
housing prices 
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The table contains five colnmns. The first colnmn, Price, lists in ascend¬ 
ing order each home price in the sample of 117 homes. Prices in this sample 
range from a minimnm of $54,000 to a maximnm of $215,000. The second 
colnmn, Freq, connts the freqnency, or nnmber of occnrrences, for each 
valne in the price colnmn. Many prices are nniqne and have freqnencies 
of 1, bnt other prices (snch as $75,000] occnr for mnltiple homes. The third 
colnmn contains the cnmnlative freqnency, connting the total nnmber of 
homes at or less than a given price. By examining the table, yon can qnickly 
see that 24 of the homes in the sample have a price of $75,000 or less. The 
fonrth colnmn lists the percentage occnrrence of each home price ont of 
the total sample. For example, 1.71% of the homes are listed for exactly 
$75,000. Finally, the fifth and last colnmn of the table calcnlates the cn¬ 
mnlative percentage for the home prices. In this case, 24.79%—almost one- 
qnarter of the homes—list for $77,300 or less. A table of this kind can help 
yon in evalnating the market. For example, if yon v\rere interested in homes 
that list for $125,000 or less, yon conld qnickly determine that almost 80% 
of the homes in this database, or 93 different listings, met that criterion. 

EXCEL TIPS_ 

^ - ‘If yon don’t have StatPlns handy. Excel comes w^ith an add-in 
called the Data Analysis ToolPak which yon can nse to create 
a freqnency table. The ToolPak does not have all the freqnency 
table options that StatPlns contains. 

• If yon want to connt how many valnes in a colnmn are eqnal to a 
specific valne, yon can nse Excel’s COUNTIF fnnction. 

• Yon can also create a freqnency table nsing Excel’s FREQUENCY 
fnnction. This fnnction nses Excel’s array featnre, which yon can 
learn abont by nsing the online Help. 


Using Bins in a Frequency Table 

By creating a freqnency table, yon got a clear pictnre of the distribntion of prices 
in the Albnqnerqne area back in 1993. However, displaying individnal valnes 
wonld be cnmbersome if the sample contained 1,000 or 10,000 observations. 

Rather than hst individnal prices, yon can have the freqnency table gronp the 
valnes by placing them in bins, where each bin covers a particnlar range of val¬ 
nes. The freqnency table wonld then connt the nnmber of valnes that fall in each 
bin. There are three ways of connting valnes in bins as shown in Fignre 4-3. 

1. Connt those valnes which are > the bin valne and < the next bin valne. 

2. Connt those valnes which are centered aronnd the bin valne (in the case 
of mid-point valnes, start connting from the lower mid-point). 

3. Connt those valnes that are < the bin valne bnt > the previons bin valne. 
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Figure 4-3 
Counting 
within a bin 


Counted values are > 
bin value and < next bin 
value 

Counted values are centered 
around the bin value (mid-points 
below the bin value are counted) 

Counted values are < bin 
value and > previous bin value 
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To interpret a frequency table that involves bins correctly, you need to 
know which of these methods is used in calculating the counts. WeTl cre¬ 
ate another frequency table of the housing prices in the workbook, this time 
breaking the data down into 15 equally spaced bins. 

To create a frequency table with bins: 

1 Click Descriptive Statistics from the StatPlus menu and then click 

Frequency Tables. 

2 Click the Data Values button and select Price as the data variable. 
Click OK. 

3 Click the Create 15 equally spaced bins option button. 

Note that the first option button has been selected, so that counts can 
be calculated for values that are > the bin value and < the succeeding 
bin value. 

4 Click the Output button and click the New Worksheet option but¬ 
ton. Type Price Table with Bins as the worksheet name and click the 
OK button. 

5 Click the OK button to start generating the frequency table with bins. 
Figure 4-4 shows the resulting frequency table. 
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Figure 4-4 
Frequency 
table with 
equally 
spaced bins 
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This frequency table gives us a little clearer picture of the distribution of 
housing prices back in 1993. Note that almost 80% of the prices are clus¬ 
tered within the first seven bins of the table (representing homes costing 
about $129,000 or less). Moreover, there are relatively few homes in the 
$160,000-$200,000 price range (only about 4% of the sample). There is, 
however, a small group of homes priced above $205,000. 


Defining Your Own Bin Values 

The bin values shown in Figure 4-4 were generated by dividing the range of 
prices into 15 equally spaced intervals. This resulted in cutoff values like 
64,733 and 75,467. However, in an analysis of pricing we are usually more 
interested in even cutoff values like 60,000 and 70,000. The StatPlus Fre¬ 
quency Table dialog box allows you to specify your own bin values in place 
of automatically generated ones. Try this now, by creating a frequency table 
of housing prices in $10,000 increments, starting at $50,000. You will first 
have to enter the bin values into cells in the workbook. 

To create your own bin values: 

Click cell Gl, type Price, and press Enter. 

In cell G2 type 50,000 and press Enter. Type 60,000 in cell G3 and 
press Enter. 


1 

2 
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Figure 4-5 
Frequency 
table with 
user-defined 
bin values 
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Select the range G2:G3, drag the fill handle down to cell G20, and 
release the mouse button. 

The values 50,000-230,000, should he now entered into the cell 
range G2:G20. 

Glick Descriptive Statistics from the StatPlus menu and then click 

Frequency Tables. 

Glick the Data Values button and select Price as the data variable. 
Glick the Bin Values button. 

Glick the Use Range References option button, select the range 
Gl:G20, and then click OK. 

Glick the <= bin and > previous bin option button to control how 
the bin counts are determined. 

Glick the Output button, click the Cells option button, and select 
cell Gl. Click the OK button. 

Click the OK button to start generating the frequency table with your 
customized bin values. Figure 4-5 displays the new frequency table. 
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This table is a lot easier to interpret than that of Fignre 4-4. Looking at the 
table, it’s easy to discover that there were only two honses in the sample in 
the $140,000-$!50,000 price range. 

STATPLUS TIPS_ 

• Yon can nse the Freqnency Table command to create tables that 
are broken down into categories on the basis of a qnalitative 
variable. To do so, click the By bntton in the Freqnency Table 
dialog box and choose the range name or range reference con¬ 
taining the valnes of the qnalitative variable. 

• If yon forget how the bin connts are determined, place yonr cnr- 
sor over the colnmn title for the bin valne. A pop-up comment 
box will appear indicating the method used. 


Working with Histograms 

Frequency tables are good at conveying specific information about a dis¬ 
tribution, but they often lack visual impact. It’s hard to get a good impres¬ 
sion about how the values are clustered from the counts in the frequency 
table. Many statisticians prefer a visual picture of the distribution in the 
form of a histogram. A histogram is a bar chart in which each bar represents 
a particular bin and the height of the bar is proportional to the number of 
counts in that bin. Histograms can be used to display frequencies, cumula¬ 
tive frequencies, percentages, and cumulative percentages. Most histograms 
display the frequency or counts of the observations. 


Creating a Histogram 


Excel does not have a chart type for the histogram, but you can create one us¬ 
ing either the Data Analysis ToolPak supplied with Excel or using the com¬ 
mand from the StatPlus add-in. Create a histogram of the price data from the 
Housing workbook using the StatPlus histogram command. 


1 

2 


To create a histogram of the home prices: 

Click Single Variable Charts from the StatPlus menu and then click 

Histograms. 

Click the Data Values button and select Price from the list of range names. 

As with the Frequency Table command, you can specify the number 
and type of bins used to construct the bars of the histogram. 
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3 Click the Chart Options dialog tab. 

4 Click the Values button and then click the Use Range References 
button. Select the range Gl:G20 in the Price Table with Bins work¬ 
sheet. Click the OK button. 

5 Click the Right option button to control how bin counts are deter¬ 
mined. See Figure 4-6. 


Figure 4-6 
Specifying 
bin options 
for the 
histogram 



6 Click the Input dialog tab to view other options for the histogram. 

7 Click the Output button. 

8 Verify that the As a new chart sheet option button is selected and 
then type Price Histogram in the accompanying text box. 

This will send the histogram to a chart sheet named Price 
Histogram. 

9 Click the OK button. 

Figure 4-7 shows the completed Histogram dialog box. Note that this 
command allows you to create histograms of the frequency, cumula¬ 
tive frequency, percentage, or cumulative percentage. In most cases, 
histograms display the frequency of a particular variable. 
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Figure 4-7 
Completed 
histogram 
dialog box 


you can create 
four different 
types of histograms 


I 0 Click the OK button to create the histogram. Figure 4-8 shows the 
completed histogram. 



Figure 4-8 
Histogram 
of housing 
prices 


Price 
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The histogram gives us the strong visual picture that most of the home 
prices in this 1993 sample were <130,000 and that most were in the $70,000- 
$100,000 range. There does not seem to he any clustering of values heyond 
$130,000; rather, the data values are clustered toward the lower end of the 
price scale. 

STATPLUS TIPS_ 

• You can also create separate histograms for the different levels 
of a categorical variable (or for different variables) by using the 
StatPlus > Multivariable Charts > Multiple Histograms command. 

• The Histogram command includes a Chart Titles button located 
on the Chart Options dialog sheet. By clicking this button, you 
can enter titles for the chart, x axis, and y axis. You can also con¬ 
trol some of the appearance of the x axis and y axis. 

• The Left option button for the bin intervals in the Histogram com¬ 
mand is equivalent to counting observations that are > bin value 
and < next bin value. The Center option button counts observations 
that are centered around the bin value (counting from the lower 
mid-point). The Right option button counts observations that 

are > bin value and < next bin value. 

• You can add a table to the output of the Histogram command by 
clicking the Table checkbox in the dialog box. This table contains 
count values, similar to what you would see in the corresponding 
frequency table. 


Shapes of Distributions 

The visual picture presented by the histogram is often referred to as the distri¬ 
bution’s shape. Statisticians classify various distributions on the basis of their 
shape. These classifications will become important later on as we look for an 
appropriate statistic to summarize the distribution and its values. Some statis¬ 
tics are appropriate for one distribution shape but not for another. 

A distribution is skewed if most of the values are clustered toward either 
the left or the right edge of the histogram. If the values are clustered to¬ 
ward the left edge of the histogram, this shows positive skewness; clustering 
toward the right edge of the histogram shows negative skewness. Skewed 
distributions often occur where the variable is constrained to have positive 
values. In those cases, values may cluster near zero, but because the vari¬ 
able cannot have a negative value, the distribution is positively skewed. 
A distribution is symmetric if the values are clustered in the middle with no 
skewness toward either the positive or the negative side. See Figure 4-9 for 
examples of these three types of shapes. 
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Figure 4-9 
Distribution 
shapes 


Positive Skewness 



Symmetric 



Negative Skewness 
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Another important component of a distribntion’s shape is the distrihn- 
tion’s tails—the valnes located to the extreme left or right edge. A distrihntion 
with very extreme observations is said to be a heavy-tailed distribution. 

The historic sample of home prices we’ve examined appears to be posi¬ 
tively skewed with a heavy tail (becanse there are a nnmber of honses lo¬ 
cated at the high end of the price scale). This is not snrprising, becanse there 
is a practical lower limit for honsing prices (aronnd $50,000 in this sample) 
and an exceedingly large npper limit. 


Breaking a Histogram into Categories 

Yon can gain a great deal of insight by breaking yonr histogram into catego¬ 
ries. In the cnrrent example, we may be interested in knowing how the 1993 
Albnqnerqne prices compared when broken down by location: Were certain 
locations more expensive than others? One of the more desirable locations 
in Albnqnerqne at the time was the northeast sector. Was this reflected in a 
histogram of the sample home prices? Let’s find ont. 

To create a histogram broken down by categories: 

1 Click the Housing Data worksheet tab to return to the price data. 

2 Click Single Variable Charts from the StatPlns menu and then click 
Histograms. Click the Data Values button, select Price from the list 
of range names as the source for the histogram, and click OK. 

3 Click the Break down the histogram by categories checkbox. 

The various categories can be displayed in a histogram as stacked on 
top of each other, side by side, or in three dimensions. You’ll see the 
effect of these choices on the histogram’s appearance in a moment. 
For now, accept the default. Stack. 

4 Click the Categories button, click the Use Range Names option but¬ 
ton, select NE Sector, and then click the OK button. 

The NE Sector variable is a qualitative variable that is equal to Yes if 
the home is located in the northeast sector and is equal to No other¬ 
wise. Now, define the options for the histogram’s bins. 

6 Click the Chart Options dialog tab. 

7 Click the Values button, click the Use Range References option but¬ 
ton, and then select the range Gl:G20 on the Price Table with Bins 
worksheet. Click the OK button. 

8 Click the Right option button to set how bin values will be counted 
in the histogram. 
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9 Click the Output button and type Price Histogram by NE Sector in 
the Chart Sheet name box. Click the OK bntton. 

I 0 Click the OK button to start creating the histogram. The completed 
chart appears in Fignre 4-10. 


Figure 4-10 
The price 
histogram 
broken 
down by the 
NE Sector 
variable 

Northeast sector 
homes 


homes outside the 
Northeast sector 



In this histogram, the height of each bar is still eqnal to the total connt of 
values within that bin, bnt each bar is fnrther broken down by the connts for 
the varions levels of the categorical variable. The connts are stacked on top 
of each other. The chart makes it clear that many higher-priced homes are 
located in the northeast sector, thongh there are still plenty of northeast sec¬ 
tor homes in the $70,000-$100,000 range. 

How do the shapes of the distributions compare for the two types of 
homes? We can’t tell from this chart, because the northeast sector homes 
are all stacked at uneven levels. To compare the distribution shapes, we can 
compare histograms side by side. We can change the orientation of the his¬ 
togram by modifying the chart type employed by Excel. 


1 

2 


To compare histograms side by side: 

Click anywhere within the chart to select it. 
Click the Design tab from the Chart Tools ribbon. 
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3 Click the Change Chart Type button on the Type group on the 
Design tab. 

4 Excel opens the Change Chart Type dialog box. From within the 
dialog box click the Column chart type and then click the first snb- 
type, Clustered Column, in the dialog box (see Figure 4-11). 


Figure 4-11 
Changing the 
chart type 



stacked chart type 


3-D chart sub-types 


5 Click the OK button. As shown in Fignre 4-12, Excel changes the 
chart type, displaying the histogram bars side by side rather than 
stacked. 
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Figure 4-12 
Histogram bars 
displayed 
side by side 



This chart shows us that the distrihution of home prices is positively 
skewed (as we would expect] for hoth northeast sector and nonnortheast 
sector homes. The primary difference is that in the northeast sector there 
are more homes at the high end. By selecting the appropriate chart snhtype, 
we can switch hack and forth between side hy side and stacked views of the 
histogram; we can even view the histogram in three dimensions. 


Working with Stem and Leaf Plots 

stem and leaf plots are another way of displaying a distrihntion, while at 
the same time retaining some information ahont individnal valnes from the 
data sample. The stem and leaf plot originally was nsed hy statisticians as 
a qnick way of generating a plot of a distrihntion nsing only pen and paper, 
hnt still has application even when graphical plots are so readily available. 
To create a stem and leaf plot, follow these steps: 

1. Sort the data valnes in ascending order. 

2. Trnncate all hnt the first two digits from the valnes (i.e., change 64,828 
to 64,000, change 14,048 to 14,000, and so forth]. The first of the two 
digits is the stem and the second the leaf. In the case of a nnmber like 
64,000, the stem is 6 and the leaf is 4. 

3. List the stems in ascending order vertically on a sheet and place a 
vertical dividing line to the right of the stems. 
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4. Match each leaf to its stem, placing the leaf values in ascending order 
horizontally to the right of the vertical dividing line. 

For example, take the following numbers: 

125,189,232,241,248,275,291,311,324,351,411,412,558,713 

Truncating all hut the first two digits from the list, leaves us with 

120,180,230,240, 240,270,290, 310, 320, 350, 410, 410, 550, 710 

The stem and leaf pairs are therefore 

(12) (18) (23) (24) (24) (27) (29) (31) (32), (35) (41) (41) (55) and (71). 

Now, we list just the stems in ascending order vertically as follows: 

lOOX I 

1 I 

2 I 

3 I 

4 I 

5 I 

6 I 

7 I 

At the top of the stem list, we’ve included a multiplier, so we know our 
data values go from 100 to 700. Note that we’ve added a stem for the value 6. 
We include this to preserve continuity in the stem list. Now we add a leaf to 
the right of each stem. The first stem and leaf pair is (12), so we add 2 to the 
right of the stem value 1, and so on. The final stem and leaf plot appears, as 
follows: 


lOOX 

1 

2 

3 

4 

5 

6 
7 


28 

34479 

125 

11 

5 

1 


The stem and leaf plot resembles a histogram turned on its side. The plot 
has some advantages over the histogram. From the stem and leaf plot, you 
can generate the approximate values of all the observations in the data set 
by combining each stem with its leaves. Looking at the plot above, you can 
quickly see that the first two stem and leaf pairs are (1.2) and (1.8). Multi¬ 
plying these values by 100 yields approximate data values of 120 and 180. 
An added advantage is that the stem and leaf plot can be quickly generated 
by hand—useful if you don’t have a computer handy. 
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This stem and leaf plot is at a disadvantage compared to the histogram 
in that the size of each hin is directly determined hy the data valnes them¬ 
selves. Stem and leaf plots also don’t work well for large data sets where 
each stem will need to display a large nnmher of leaves. One way of modi¬ 
fying a stem and leaf plot is to split the stems into snhgronps. For example 
yon can split a stem into two gronps: those with leaves having valnes from 
0 to 4 and those with leaves from 5 to 9. Doing this for the above chart yields 
the following stem and leaf plot: 


lOOX 

1 

1 

2 

2 

3 

3 

4 

4 

5 

5 

6 
6 
7 
7 


2 

8 

344 

79 

12 

5 

11 


5 


1 


Another modification to the stem and leaf plot is to trnncate lower and 
npper valnes in order to rednce the range of stems in the plot. This is nse- 
fnl in sitnations where yon have an extreme valne whose presence wonld 
greatly elongate the plot’s appearance. For example, if the valne 2,420 is 
added to the above data set, then the resnlting stem and leaf plot will have 
a long stem with a long list of empty leafs. In this case, removing this valne 
from the stem and leaf plot, bnt noting its valne elsewhere, might be the best 
conrse of action. The plot might look as follows: 


lOOX 

1 


1 

1 

28 

2 

1 

34479 

3 

1 

125 

4 

1 

11 

5 

1 

5 

6 

1 


7 

1 

1 


2400 


Excel does not have a command to create stem and leaf plots, bnt yon can 
create one nsing StatPlns. Let’s create a stem and leaf plot for the home price 
data and compare it to the histogram we created earlier. As before, we’ll break 
the stem and leaf plot down nsing the valnes of the NE Sector variable. 
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To create a stem and leaf plot: 

1 Return to the data set by clicking the Housing Data worksheet tab. 

2 Click Single Variable Charts from the StatPlns menu and then click 

Stem and Leaf. 

This command allows you to create plots of variables located in dif¬ 
ferent columns or within a single column, broken down by category 
levels. You’ll do the latter in this case. 

3 Verify that the Use column of category levels option button is se¬ 
lected and then click the Data Values button and select Price from 
the list of range names. Click OK. 

4 Click the Categories button and select NE_Sector from the list of 
range names. Click OK. 

5 Click the Apply uniform stem values checkbox. This will apply the same 
stem values to home prices both in the northeast sector and elsewhere. 

6 Click the Add a summary plot checkbox. This will create a stem 
and leaf plot of prices for all of the homes, regardless of location. 

7 Click the Output button, chck the New Worksheet ophon button, and type 
Price StemLeaf in the New Worksheet name box. Click the OK button. 

Figure 4-13 shows the completed Stem and Leaf dialog box. 


Figure 4-13 
The completed 
stem and leaf 
dialog box 
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Click the OK button. Excel generates the stem and leaf plot shown 
in Figure 4-14. 



stem multiplier 


Figure 4-14 
Stem and leaf 
plot of the 
housing data 


stem values leaf values 
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In this plot, the stem values occupy the first column and the leaf valnes 
are placed in the following three colnmns for homes outside the northeast 
sector, in the northeast sector, and over all sectors. Note that cell Al identi¬ 
fies the stem multiplier, indicating that each stem value must be multiplied 
by 10,000 in order to calculate the underlying data values. 

Let’s see how this works. The first stem value is 5; this represents 50,000. 
The first leaf valne is 4 (where the NE_Sector variable equals No), which 
would represent a value one decimal place lower, or 4,000. Thns the first 
data value in this plot equals the stem value plus the leaf value, or 54,000, 
which is equal to the value of the lowest-priced home in the sample. Using 
the same method, yon can calculate the value of the highest-priced home 
to be $215,000. You can also see at a glance that there are no homes in the 
$170,000-$! 79,000 price range, though there is one home priced at abont 
$169,000 (actually $169,500). In addition to this information, you can also 
use the stem and leaf plot to make the same observations about the shape of 
the distribution that you did earlier with the histogram. 


150 Fundamentals of Statistics 














Distribution Statistics 


You should always create a chart of the distrihution when analyzing a data 
set, hnt once yon’ve done that, yon’ll prohahly look for statistics that snm- 
marize key elements of the distrihntion. These valnes are sometimes called 
landmark summaries hecanse they are nsed as landmarks, comparing indi- 
vidnal valnes to whole popnlations, or whole popnlations to each other. 


Percentiles and Quartiles 


One of these landmark snmmaries is the pth percentile, which is a valne 
snch that ronghly p% of the data are smaller than that valne. Yon may have 
seen percentiles nsed in growth statistics, where the progress of a newborn 
child will place him or her in the 75th percentile or 90th percentile, meaning 
that the child’s weight is eqnal to or above 75 or 90% of the popnlation. In 
the Albnqnerqne data, percentiles conld be nsed as a benchmark to compare 
one commnnity of that era to another. If yon knew the 10th and 90th percen¬ 
tiles for home price, yon wonld have a basis for comparison between the two 
commnnities. 

Perhaps the most important percentiles are the quartiles, which are the 
values located at the 25th, 50th, and 75th percentiles (the quarters). These 
are commonly referred to as the first, second, and third quartiles. Statisti¬ 
cians are also interested in the interquartile range, which is the difference 
between the first and third quartiles. Because the central 50% of the data 
lie within the interquartile range, the size of this value gives statisticians an 
idea of the width of the distribution. 

One way of calculating the percentiles and quartiles of a given distribution 
is to create a frequency table like the one shown earlier in Figure 4-2. From the 
column of cumulative percents, you can determine which values correspond 
to the 10th, 25th, 50th, 75th, and 90th, and so on, percentiles. However, if 
your data set is large, this can be a cumbersome and time-consuming process. 
To save time. Excel has several functions that will calculate these values for 
you. A list of these functions is shown in Table 4-4. 


Table 4-4 Excel functions to calculate percentiles and quartiles 


Function 


Description 

Returns the kth percentile of an array of values or 
range reference, where kis a value between 0 and 1. 
Returns the percentile of a value taken from an array 
of values or range reference. The number of digits is 
determined by the significance parameter. 


PERCENTILEfarray, k) 


PERCENTRANKfarray, x, 
significance) 


(continued) 
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QUARTILElarray, quart) 


IQRlarray) 


Returns the quartile of an array of valnes or range 
reference, where quart is either 1, 2, or 3 for the first, 
second, or third qnartile. 

Calcnlates the interqnartiie range of the valnes in an 
array or range reference. StatPIus required. 


Excel allows yon to work with percentiles in two different ways. Yon can 
nse the PERCENTILE function to take a percentile and determine the corre¬ 
sponding data value, or, given the data value, you can use the PERCENTRANK 
function to determine its percentile. 

You can create a table of percentile and quartile values hy typing in 
the above Excel formulas, or you can have StatPIus do it for you with the 
Univariate Statistics command. The Univariate Statistics command also 
allows you to break down the variable into different levels of a categorical 
variable. 

In this example you’ll limit yourself to percentiles and quartiles. Create 
such a table now of the housing prices broken down by location. 

To create a table of percentile and qnartile valnes: 

Click Descriptive Statistics from the StatPIus menu and then click 

Univariate Statistics. 

Click the Inpnt button and select Price from the list of range names. 

Click the Output button, click the New Worksheet option but¬ 
ton, and type Percentiles in the New Worksheet box. Click the OK 
button. 

Click the By button and select NE_Sector from the list of range 
names. 

Click the Distribution dialog tab. 

Click each of the checkboxes for the different percentiles. See 
Figure 4-15. 
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Figure 4-15 
The Univariate 
Statistics 
dialog box 



7 


Click the OK button to create the table of percentiles. Fignre 4-16 
shows the completed table. 


Figure 4-16 
Table of 
percentiles 
for the price 
variable 
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The table of percentiles gives us some additional information about 
the housing prices. The valnes are pretty close between the two locations 
np to the 50th percentile, after which large differences begin to appear. 
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It’s particularly striking to note that the 90th percentile for home prices ont- 
side the northeast sector was $130,200, whereas for northeast sector homes 
it was $172,650—$40,000 more. We noted earlier that there are more high- 
priced homes in the northeast sector. 

EXCEL TIPS_ 

^ . • Yon can also get a table of cnmnlative percents nsing the Rank 

and Percentile command in the Data Analysis ToolPak, an add¬ 
in packaged with Excel. 

• Yon nse the Data Analysis ToolPak to also create a table of 
descriptive statistics. 


Measures of the Center: Means, Medians, and the Mode 

Another way to snmmarize a data set wonld be to calcnlate a statistic that 
snmmarized the contents into a single valne that we wonld think of as the 
typical or most representative valne. The table of percentiles snggests one 
snch valne: the 50th percentile, or median. Becanse the median is located at 
the 50th percentile, it represents the middle of the distribntion: Half of the 
valnes are less than the median, and half are greater than the median. Based 
on the results from Figure 4-16, the median honse price in the Albnqnerqne 
sample from 1993 was $94,000 for nonnortheast sector homes, $98,000 for 
northeast sector homes, and $96,000 overall. 

The exact calcnlation of the median depends on the nnmber of observa¬ 
tions in the data set. If there is an odd nnmber of valnes, the median is the 
middle valne, bnt if there is an even nnmber of valnes, the median is eqnal 
to the snm of the two central valnes divided by 2. 

Another commonly nsed snmmary measnre is the average. The average, 
or mean, is eqnal to the snm of the valnes divided by the nnmber of ob¬ 
servations. This value is usually represented by the symbol x (prononnced 
“x-bar”), a convention we’ll repeat thronghont the conrse of this book. 
Expressed as a formnla, this is 

_ Sum of values 

Number of observations 

^ X■^ + X2 + ■ ■ ■ + x„ 
n 


i = l 


n 
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The total number of observations in the sample is represented by the 
symbol n, and each individual value is represented by x follow^ed by a sub¬ 
script. The first value is x^, the second value is x^, and so forth, up to the last 
value (the nth value], which is represented by x^. The formula calls for us 
to sum all of these values, an operation represented by the Greek symbol 2 
(pronounced “sigma”), a summation symbol. In this case, we’re instructed 
to sum the values of x., where i changes in value from 1 up to n; in other 
words, the formula tells us to calculate the value oi x^ + x^ + '" + x^. The 
average, or mean, is equal to this expression divided by the total number of 
observations. 

How do these two measures, the median and the mean, compare? One weak¬ 
ness of the mean is that it can be influenced by extreme values. Figure 4-17 
shows a distribution of professional baseball salaries. Note that most of the 
salaries are less than $1 million per year, but there are a couple of players 
who make more than $20 million per year. What, then, is a typical salary? 
The median value for this distribution is about $3,500,000, but the mean sal¬ 
ary is almost $4,700,000. The median seems more representative of what the 
typical player makes, whereas the mean salary is higher as a result of the 
influence of a couple of much larger salaries. If you were a union representa¬ 
tive negotiating a new contract, which figure would you quote? If you repre¬ 
sented management, which value better reflects your expenses in salaries? 


Figure 4-17 
Distribution 
of baseball 
salaries 


BsMball Salaries 
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The lesson from this example is that you should not blindly accept any 
single summary measure. The mean is sensitive to extreme values; the me¬ 
dian overcomes this problem by ignoring the magnitude of the upper and 
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lower values. Both approaches have their limitations, and the hest approach 
is to examine the data, create a histogram or stem and leaf plot of the dis¬ 
tribution, and thoroughly understand your data before attempting to sum¬ 
marize it. Even then, it may be best to include several summary measures to 
compare. 

The mean and median are the most common summary statistics, but there 
are others. Let’s examine those now. 

One method of reducing the effect of extreme values on the mean is to 
calculate the trimmed mean. The trimmed mean is the mean of the data val¬ 
ues calculated after excluding a percentage of the values from the lower and 
upper tails of the distribution. For example, the 10% trimmed mean would 
be equal to the average of the middle 90% of the data after exclusion of val¬ 
ues from the lower and upper 5% of the range. The trimmed mean can be 
thought of as a compromise between the mean and the median. 

Another commonly used measure of the center is the geometric mean. 
The geometric mean is the nth root of the product of the data values. 

Geometric mean = ^(xi) • (X 2 ) • . . . (x„) 

Once again, the symbols to represent the individual data values from 
a data set with n observations. The geometric mean is most often used when 
the data come in the form of ratios or percentages. Certain drug experiments 
are recorded as percentage changes in chemical levels relative to a baseline 
value, and those values are best summarized by the geometric mean. The 
geometric mean can also be used in situations where the distribution of the 
values is highly skewed in the positive or negative direction. The geometric 
mean cannot be used if any of the data values are negative or zero. 

Another measure, not widely used today (though the ancient Greeks used it 
extensively), is the harmonic mean. The formula for the harmonic mean H is 

1 1^1 
H n Xi 

The harmonic mean can be used to calculate the mean values of rates. For 
example, a car traveling at a rate of S miles per hour to a destination and 
then at a rate of T miles per hour on the return trip, travels at an average rate 
equal to the harmonic mean of S and T. 

Our final measure of the center is the mode. The mode is the most fre¬ 
quently occurring value in a distribution. The mode is most often used when 
we are working with qualitative data or discrete quantitative data, basically 
any data in which there are a limited number of possible values. The mode 
is not as useful in continuous quantitative data, because if the data are truly 
continuous, we would expect few, if any, repeat values. 

Table 4-5 displays the Excel functions used to calculate the various mea¬ 
sures of the distribution’s center. 
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Table 4-5 Excel functions to calculate the distribution’s center 


Function 

AVERAGE (array) 

GEOMEAN (array) 
HARMEAN(arrayj 
MEDIAN (array) 
MODE(array) 

TRIMMEAN(array, percent) 


Description 

Returns tlie average or mean of tire 
values in an array or data range. 

Returns the geometric mean of the values 
in an array or data range. 

Returns the harmonic mean of the values 
in an array or data range. 

Returns the median of the values in an 
array or data range. 

Returns the most frequently occurring 
value in an array or data range. 

Returns the trimmed mean of the values 
in array or data range, excluding the 
lower and upper values where percent is 
the fractional number of data points to 
exclude. The function rounds the number 
of excluded data points down to the 
nearest multiple of 2. If percent = 0.3 
and array contains 30 data points, 30 
percent of 30 equals 9 and thus 8 points 
are excluded: four from the upper range 
and four from the lower range. 


Now that you’ve learned a little about these functions, use the Univari¬ 
ate Statistics command from the StatPlus add-in to generate a table of their 
values. 


To create a table of mean and median values: 

1 Glick Descriptive Statistics from the StatPlus menu and then click 

Univariate Statistics. 

2 Glick the Input button and select Price from the list of range names. 

3 Glick the Output button, click the New Worksheet option button, 
and type Means in the New Worksheet box. Glick the OK button. 

4 Glick the By button and select NE_Sector from the list of range 
names. 

5 Glick the Summary dialog tab. 

6 Glick the Show all summary statistics checkbox. Figure 4-18 shows 
the completed dialog box. 
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Figure 4-18 
Selecting 
summary 
statistics for 
the Price 
variable 


Figure 4-19 
Summary 
statistics for 
the Price 
variable 



7 


Click the OK button to create the table of values (see Figure 4-19). 



As we would expect, the average price of a home in Albnquerque from 
our historic sample is higher than the median valne. This effect is more no¬ 
ticeable in the northeast sector homes becanse of the gronp of high-priced 
homes in that location. The mean home price was almost $12,000 greater 
than the median due to the positive skewness in the data. 
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Measures of Variability 


The mean and median do not tell the whole story about a distrihution. It’s also 
important to take into account the variahility of the data. Variability is a measure 
of how much data values differ from one another, or equivalently, how widely 
the data values are spread out around the center. Consider the pair of histograms 
shown in Figure 4-20. The mean and median are the same for hoth distrihutions, 
hut the variahility of the data is much greater for the second figure. 


Figure 4-20 
Distributions 
with low and 
high variability 




The simplest measure of variahility is the range, which is the difference 
between the maximum value in the distribution and the minimum value. 
A large variability usually results in a large range of values. However, 
the range can be a poor and misleading measure of variability. As shown in 
Figure 4-21, two distributions can have the same range but be very different 
in the variability of their data. 
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Figure 4-21 
Distribu¬ 
tions with 
different 
variability 
but the 
same range 



range 



The most common measure of variability depends on the deviation of 
each data value from the sample average. For each data value x., calculate 
the deviation d., which is the difference between sample value and the sam¬ 
ple average, or 

dj = Xj — X 

Some of these deviations will be negative (where the data value is less than 
the mean), and some will be positive, so we cannot simply take the average 
of the deviations because the positive and negative values would cancel each 
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other out. In fact, the sum of the deviations hetv\reen each sample value and 
the sample mean equals zero, so the average deviation is also zero. 

Instead of averaging the deviations, we’ll square each deviation (to make 
it positive) and then sum those values and divide hy the number of observa¬ 
tions minus 1. This value, known as the variance, is represented by s^. The 
formula for calculating is 

Sum of squared deviations 

_ ■ 'll 

Number of observations — 1 


n 


^ i = l 


xy 


One measure of variability, the standard deviation (represented by the sym¬ 
bol s), is calculated by taking the square root of the variance. The complete 
formula for the standard deviation s is 


s = 



(x, - x)^ 
n - 1 


Why do we divide the total of the squared deviations by n - 1, rather than n? 
Recall that the sum of the deviations is known to be zero, so given the first 
n -1 deviations, we can always calculate the remaining deviation. This means 
only n - 1 of the deviations can vary freely; the last value is constrained by 
the values of the preceding deviations. This figure, n - 1, is known as the 
degrees of freedom and is a value that will become more important in the 
chapters that follow. 

The standard deviation represents the typical deviation of values from the 
average. A large value of s indicates a high degree of variability in the data. 
High is a relative term, and we usually speak about high degrees of variability 
only when comparing one distribution with another. Table 4-6 summarizes 
the different functions supported by Excel to describe the variability of data. 


Table 4-6 Formulas to calculate variability of values in data sets 


Function 

AVEDEV(arra 7 ) 

DEVSQ(arra7) 

MAX(arra7) 

MIN(arra7) 

STDE V(arra7) 

VAR(arra7) 

RANGEVALUE 

[array) 


Description 

Returns the average of the absolute value of the deviations in an array 
or data range. 

Returns the sum of the squared deviations in an array or data range. 
Returns the maximum value in an array or data range. 

Returns the minimum value in an array or data range. 

Returns the standard deviation of the values in an array or data range. 
Returns the variance of the values in an array or data range. 

Returns the range of the values in an array or range reference. 

StatPlus required. 
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Measures of Shape: Skewness and Kurtosis 


Table 4-7 


You’ve seen that different distribntions can be characterized by their shape. 
For example, a distribntion may be skewed positively or negatively or may 
be symmetric abont its midpoint. These visnal jndgments we make of a dis- 
tribntion’s shape also can be qnantified with a statistic. One of these is the 
skewness statistic. Skewness is a measnre of the lack of symmetry in the 
distribntion of the data valnes. 

A positive skewness valne indicates a distribntion with valnes clnstered 
toward the lower range of valnes with a long tail extending toward the np- 
per valnes’ range. A negative skewness indicates jnst the opposite, with the 
long tail extending toward the valnes lower in the data range. A skewness of 
zero indicates a symmetric distribntion. 

Another statistic, kurtosis, measnres the heaviness of the tails in the dis¬ 
tribntion. A positive knrtosis indicates more extreme valnes than expected 
in the distribution. A negative knrtosis indicates fewer extreme values than 
expected. Table 4-7 shows the Excel fnnctions nsed to calcnlate skewness 
and knrtosis. 

Excel functions to calculate skewness and kurtosis 

Function Description 

KURT(array) Returns the kurtosis of the values in an array 

or data range. 

SKEW(array) Returns the skewness of the values in an array 

or data range. 


Use the Univariate Statistics command from the StatPlus menu to 
calculate the variability and shape statistics for the prices of homes in the 
Albuquerque sample. 


/\ 

1 

2 

3 

4 

5 

6 


To create a table of variability and shape statistics: 

Click Descriptive Statistics from the StatPlus menu and then click 

Univariate Statistics. 

Click the Input button and select Price from the list of range names. 

Click the Output button, click the New Worksheet option button, 
and type Price Variances in the New Worksheet box. Click the OK 
button. 

Click the By button and select NE_Sector from the list of range names. 
Click the Variability dialog tab. 

Click the Show all variability statistics checkbox. See Figure 4-22. 
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Figure 4-22 
Selecting 
variability 
statistics 
from the 
Univariate 
Statistics 
dialog box 


Figure 4-23 
Variability 
statistics 



7 Click the OK button to create the table. Figure 4-23 shows the vari¬ 
ability statistics generated from the Univariate Statistics command. 



On the basis of the output from Figure 4-23, we note that the variability of 
the 1993 Albuquerque home prices was higher in the northeast sector than 
outside of it (though it’s interesting to note that the range of home prices 
was higher for nonnortheast sector homes). 
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STATPLUS TIPS_ 

• You can select all summary, variability, or distribution statistics 
by clicking the appropriate checkboxes in the General dialog 
sheet of the Univariate Statistics dialog box. 

• The Univariate Statistics command can display the table with 
statistics displayed in rows or in columns. 

• You can add your own custom title to the output from the 
Univariate Statistics command by typing a title in the Table 
Title box in the General dialog sheet. 


Outliers 


As the earlier discussion on means and medians showed, distribution sta¬ 
tistics can be heavily affected by extreme values. It’s difficult to analyze a 
data set in which a single observation dominates all of the others, skewing 
the results. These values, known as outliers, don’t seem to belong with the 
others because they’re too small, too large, or don’t match the properties one 
would expect for them. As you’ve seen, a large salary can affect an analysis 
of salary values, pushing the average salary value upward. An outlier need 
not be an extreme value. If you were to analyze fitness data, the records of 
an extremely fit 75-year-old might not be remarkable compared to all of the 
values in the distribution, but it might be unusual compared to the values of 
others in his or her age group. 

Outliers are caused by either mistakes in data entry or an unusual or 
unique situation. A mistake in data entry is easier to deal with: You discover 
and correct the mistake and then redo the analysis. If there is no mistake, 
you have a bigger problem. In that case you have to study the outlier and 
decide whether it really belongs with the other data values. For example, 
in a study of Big Ten universities, we might decide to remove the results 
from Northwestern because that school, unlike the other schools, is a small, 
private institution. In the Albuquerque data, we might remove a high- 
priced home from the sample if that house were a public landmark and thus 
uniquely expensive. 

However, and this point cannot be emphasized too strongly, merely being 
an extreme value is not sufficient grounds to remove an observation. Many 
advances have been made by scientists studying the observations that didn’t 
seem to fit the expected distribution. Extreme values may be a natural part 
of the data (as with some salary structures). By removing those values, you 
are removing an important aspect of the distribution. 

One possible solution to the problem of outliers is to perform two analy¬ 
ses: one with the outliers and one without. If your conclusions are the same, 
you can be confident that the outlier had no effect. If the results are ex¬ 
tremely different, you can report both answers with an explanation of the 
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differences involved. In any case, yon shonld not remove an observation 
withont good canse and docnmentation of what yon did and why. 

What constitntes an ontlier? How large (or small] mnst a valne be be¬ 
fore it can be considered an ontlier? One accepted definition depends on 
the interqnartile range (IQR; recall that the interqnartile range is eqnal to 
the difference between the third and first qnartiles). 

1. If a valne is greater than the third qnartile pins 1.5 X IQR or less than 
the first qnartile minns 1.5 X IQR, it’s a moderate outlier. 

2. If a valne is greater than the third qnartile pins 3 X IQR or less than the 
first qnartile minns 3 X IQR, it’s an extreme outlier. 

A diagram displaying the bonndaries for moderate and extreme ontliers 
is shown in Fignre 4-24. 


Figure 4-24 
The range of 
moderate 
and extreme 
outliers 
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For example, if the first qnartile eqnals 30 and the third qnartile eqnals 
80, the interqnartile range is 50. Any valne above 80 -I- (1.5 X 50), or 155, 
wonld be considered a moderate ontlier. Any valne above 80 -I- 150, or 230, 
wonld be considered an extreme ontlier. The lower ranges for ontliers wonld 
be calcnlated similarly. 

This definition of the ontlier plays an important role in constrncting one 
of the most nsefnl tools of descriptive statistics—the boxplot. 


Working with Boxplots 

In this section, we’ll explore one of the more important tools of descriptive 
statistics, the boxplot. Yon’ll learn abont boxplots interactively with Excel, 
and then yon’ll apply what yon’ve learned to the Albnqnerqne price data. 
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CONCEPT TUTORIALS: 

Boxplots 

The files available with this book contain several instrnctional workbooks. 
The instructional workbooks provide interactive worksheets and macros to 
allow yon to explore varions statistical concepts on yonr own. The first of 
these workbooks that yon will examine concerns the box plot. Open this 
workbook now. 

To start the Boxplots instructional workbook: 

1 Open the file Boxplots, located in the Explore folder. 

The workbook opens to the Contents page, describing the workbook. 
To the left side of the display area is a colnmn of snbject titles. Yon 
can move between snbject titles either by clicking an entry in the 
colnmn or by clicking the arrow icon located at the top of the page. 

2 Click What is a boxplot? from the list of subject titles. The page 
shown in Figure 4-25 appears. 


Figure 4-25 
Initial 
worksheet 
from the 
Boxplots 
Explore 
workbook 
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Boxplots are designed to display in a single chart several of the 
important descriptive statistics, including the quartiles of the dis¬ 
tribution as well as the minimum and the maximum. They will 
also identify any moderate or extreme outliers (using the definition 
supplied above). 
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3 Click The interquartile from the list of subject titles. 

The box part of the boxplot displays the interquartile range of the 
distribution, ranging from the first quartile to the third. The median 
is shown as a horizontal line within the box. Note that the median 
need not be in the center of the box. The box tells yon where the 
central 50% of the data is located. By observing the placement of 
the median within the box, yon can also get an indication of how 
those valnes are clnstered within that central 50%. A median line 
close to the first quartile indicates that a lot of the values are clus¬ 
tered in the lower range of the distribntion. 


Figure 4-26 
The “box” 
from the 
boxplot 
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4 Click The fences from the list of subject titles. 

The inner and outer fences of the boxplot set the boundaries between 
standard observations, moderate outliers, and extreme outliers. Note 
that the formula for the fences matches the formula for moderate 
and extreme outliers discussed in the previous section. 
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Figure 4-27 
Constructing 
the "fences" 
of a boxplot 
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Figure 4-28 
Representing 
outliers from 
the sample 
distribution 


5 Click Outliers from the list of subject titles. 

If there are any moderate or extreme outliers in the distribution, 
they’re displayed in the boxplot. Moderate outliers are displayed us¬ 
ing a black circle •. Extreme outliers are represented by an open 
circle o. With a boxplot you can quickly see the outliers in your dis¬ 
tribution and their severity. 
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6 Click The whiskers from the list of subject titles. 

The final component of the boxplot are the whiskers. These are lines 
that extend from the boxplot to the highest and lowest points that 
lie inside the moderate ontliers. Thns the lines indicate the smallest 
and largest valnes in the distribntion that are not considered ontli¬ 
ers. The length of the whisker lines also gives you a further indica¬ 
tion of the skewness of the distribntion. 


Figure 4-29 
Drawing 
the box 
"whiskers" 



In the finished boxplot, the inner and onter fences are not shown. 
Fignre 4-30 shows a typical boxplot. 
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Figure 4-30 
The 
completed 
boxplot 
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Figure 4-31 shows how a hoxplot might look for distrihutions with posi¬ 
tive or negative skewness and for symmetric distrihntions. 


Figure 4-31 
Boxplots of 
different 
distribution 
shapes 



Now that yon’ve learned ahont the strnctnre of the hoxplot, try creating a 
few hoxplots on yonr own with sample data. 
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To create your own boxplot: 

1 Click Create your own boxplot from the list of topics in the Boxplots 
workbook. 

2 Enter the following nnmhers into the green cells located to the left 
of the empty chart: 0, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 6, 7. 

As yon enter the nnmhers, the chart is antomatically npdated to re¬ 
flect the new distrihntion. The final form of the hoxplot is shown in 
Figure 4-32. 


Figure 4-32 
Entering 
data into a 
sample box 
plot 
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The central 50% of the data is found in the range from 2.5 to 4.5. The 
median value is 3, which is not in the middle of that central 50%. 

From the plot we can see that the values range from 0 to 7. There 
are no outliers in the distribution. Now let’s see what happens if we 
change a few of those numbers. 

Change the last two numbers in the sample data from 6 and 7 to 9 
and 12. Figure 4-33 shows the updated chart. 
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Figure 4-33 
Sample 
boxplot with 
moderate 
and extreme 
outliers 
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The new sample values appear in the hoxplot as moderate and extreme 
outliers respectively. From the hoxplot we can see that there is a large gap 
between the moderate outlier and the largest standard value. Continue to 
explore hoxplots with the instructional workbook. Try different combina¬ 
tions of values and different types of distributions. 

When you’re finished with the Boxplots workbook: 

1 Close the Boxplots workbook. You do not have to save any of your 
changes. 

2 Return to the Housing Statistics workbook in Excel. 


Excel does not contain any commands to create boxplots, but you can cre¬ 
ate one using StatPlus. The StatPlus command includes the added feature of 
displaying a dotted line representing the average value for the distribution. 
Try creating a boxplot comparing the NE sector from the Albuquerque data 
to other homes in the sample. 
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To create a boxplot of the price data: 

1 Click Single Variable Charts from the StatPlus menu and then click 

Boxplots. 

The Boxplots command allows you to create hoxplots on the basis of 
values in separate columns or within a single column broken down 
by the levels of a categorical variable. In this case you’ll use a single 
column, Price, and a categorical variable, NE_Sector. 

2 Verify that the Use column of category levels option button is 
selected. 

3 Click the Data Values button and choose Price from the list of range 
names. 

4 Click the Categories button and choose NE_Sector from the list of 
range names. 

5 Click the Output button, click the As a new chart sheet option but¬ 
ton, and type Price Boxplot in the adjacent name box. Click the OK 
button. Figure 4-34 shows the completed dialog box. 


Figure 4-34 
The completed 
Boxplots 
dialog box 
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Click the OK button. The output from the Boxplot command is 
shown in Figure 4-35. 
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Figure 4-35 
Boxplot of 
1993 housing 
prices in the 
Albuquerque, 
New Mexico 
area 



The boxplot gives us yet another visual picture of our data. Note that 
the three extreme prices in the nonnortheast sector homes are all consid¬ 
ered outliers and that the range of the other valnes for that area extends 
from abont $50,000 to around $140,000. Were those homes overpriced for 
their area? There may be something unusual about those three homes that 
would require further research. The range of valnes for the northeast homes 
is clearly much wider, and only homes above $200,000 are considered mod¬ 
erate ontliers. The boxplot also gives ns a visual picture of the difference 
between the mean and median for the northeast homes. This information 
wonld certainly cantion ns against blindly using only the mean to snmma- 
rize onr resnlts. 

STATPLUS TIPS_ 

• To specify the chart’s title or a title for the x axis or y axis, click 
the Chart Options button in the Boxplot dialog box. 


You’ve completed your work on the Albnquerque data. You can close the 
workbook and exit Excel now. 
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Exercises 


1. Define the following terms: 

a. Qnantitative variable 

b. Qnalitative variable 

c. Continnons variable 

d. Ordinal variable 

e. Nominal variable 

2. What is a skewed distribnbon? What is pos¬ 
itive skewness? What is negabve skewness? 


a. What are the approximate valnes of 
the data set? 

b. Is the distribntion positively skewed, 
negatively skewed, or symmetric? 

c. Give approximate valnes for the mean 
and the median. 

d. What valnes, if any, appear to be 
moderate and extreme ontliers in the 
distribntion? 


3. What is a stem and leaf plot? What are the 
advantages of stem and leaf plots over histo¬ 
grams? What are some of the disadvantages? 

4. What is the interqnartile range? 

5. True or false (and why): Distributions 
with the same range have approximately 
the same variability. 

6. What are outliers? What is considered a 
moderate outlier? What is considered an 
extreme outlier? 

7. True or false (and why): Outliers should 
be removed from a data set before calcu¬ 
lating statistics on that data set. 


10. A data distribution has a median value 
of 22, a first-quartile value of 20, and a 
third-quartile value of 30. Five observa¬ 
tions lie outside the interval from the 
first to the third quartile, with values of 
17, 18, 40, 50, and 75. 

a. Draw the boxplot for this distribution. 

b. Is the skewness positive, negative, or 
zero? 

11. You’re asked to do further research on 
the housing market in Albuquerque, 
New Mexico, during the early 1990s. In 
this analysis you’ll examine the size of 
the homes sold on the market and the 
price per square foot of each home. 


8. What is a boxplot? What are the advan¬ 
tages of boxplots over histograms? What 
are some of the disadvantages? 

9. You see the following stem and leaf plot 
in a technical journal: 


a. Open the Housing workbook from the 
Chapter04 folder and save it as Home 
Sizes. 

b. Create a table of univariate statistics 
for the size of the homes in square 
feet, including all distribution, vari- 


Stem X 100 

1 Leaf 

ability, and summary statistics except 

0 

1 336 

the mode. Place the table on a work- 

1 

1 01228 

sheet named Sq Ft. Stats. 

2 

1 00111249 

c. What are the smallest and largest 

3 

1 04 

houses in the sample? 

4 

1 5 

d. If you were interested only in houses 

5 

1 

that were 2,200 square feet or higher. 

6 

1 1 

what percentage of the houses in 

7 

1 

the sample would meet the require- 

8 

1 

ment? [Hint Use the PERCENTRANK 

9 

1 0 

function.) 
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e. Create a boxplot of the size of the 
homes in square feet. Place the boxplot 
in a chart sheet named Sq Ft. Boxplot. 

f. What value appears to be an extreme 
outlier in the boxplot? 

g. Create a second boxplot of house size, 
this time breaking the boxplot down 
by whether the home is a corner lot. 
Place the plot in a chart sheet named 
Sq Ft. Boxplot by Corner Lot. 

h. Interpret your boxplot in terms of the 
relationship between size of a house 
and whether it lies on a corner lot. 

i. What happened to the extreme outlier 
you identified earlier? Discuss this in 
terms of the definition of outlier given in 
the text. Is this value an outlier or not? 

j. Recreate the table of univariate statis¬ 
tics for house size, and this time break 
the table down by the Corner Lot vari¬ 
able. Place the table in a worksheet 
named Sq Ft. Stats by Corner Lot. 

k. Create a new column containing the 
price per square foot of each house. 
Assign the values in the new column 
a range name. What type of variable 
is this? 

l. Create a histogram with 20 evenly 
spaced bins of the price per square 
foot on a chart sheet named PPSqFt 
Histogram. Count the bins totals to 
the right of the cutoff points. 

m. What is the shape of the distribution 
of price per square foot? 

n. Create a boxplot of the price per 
square foot saved to the chart sheet 
PPSqFt. Boxplot. Are there any severe 
outliers? How does the median value 
compare to the mean value? 

o. Save your workbook and then 
write a report summarizing your 
observations. 

12. Data have been recorded on 50 of the 

largest woman-owned businesses in 

Wisconsin. Analyze and report the 

descriptive statistics on this data set. 


a. Open the Woman-Owned Businesses 
from the Chapter04 folder and save it 
as Woman-Owned Business Statistics. 

b. Create a table of the distribution, vari¬ 
ability, and summary statistics except 
the mode for the Employees variable. 
Store the table in a worksheet named 
Employee Stats. 

c. What is the average number of em¬ 
ployees for the 50 businesses? What 

is the median amount? Which statistic 
do you think more adequately 
describes the size of these businesses? 
How does the average number of em¬ 
ployees compare to the third quartile? 

d. Create a boxplot of employees stored 
in a chart sheet named Employee 
Boxplot. How would you describe 
this distribution? 

e. Create a new variable containing the 
base 10 log of the Employees vari¬ 
able. Assign a range name to this new 
column and then create a boxplot of 
these values in a chart sheet named 
Log Employee Boxplot. How does the 
shape of this distribution compare to 
the untransformed values? 

f. Create a table of descriptive sta¬ 
tistics except the mode for the 
log(Employees) and store the table 
in a worksheet named Log Employee 
Stats. Compare the skewness and 
kurtosis values between the Employ¬ 
ees and log(Employees] variables. 
Explain how the difference in the dis¬ 
tribution shapes is reflected in these 
two statistics. 

g. Calculate the mean log(Employees) 
value in terms of the number of peo¬ 
ple employed (in other words, trans¬ 
form this value back to the original 
scale). How does this value compare 
to the geometric mean of the number 
of employees in each company? 

h. The geometric mean is used for val¬ 
ues that either are ratios or are best 
compared as ratios. Which pair of 


176 Fundamentals of Statistics 



companies is more similar in terms of 
size: a company totaling $50,000,000 
in annual sales and a company with 
$10,000,000, or a company with 
$450,000,000 in sales and a company 
with $400,000,000 in sales? What are 
the differences in sales between the 
two sets of companies? What are the 
ratios? Does the difference or the ratio 
better express the similarity of the 
companies? 

i. Save your workbook and write a report 
summarizing your analysis. Explain 
how transforming the employee values 
using the logarithm function affected 
the distribution of the data. 

13. In the late 1980s, the U.S. Congress held 
several joint hearings on discrimination 
in lending practices, particularly in the 
mortgage industry. Refusal rates from 
20 lending institutions were presented 
to the committee. Analyze these rates: 

a. Open the Mortgage workbook from 
the Chapter04 folder and save it as 
Mortgage Refusal Rates. 

b. Create a table of univariate statistics 
for the four data columns. Save the 
table in a worksheet named Refusal 
Statistics. 

c. Create a boxplot of the refusal rates 
for the four data columns stored in a 
chart sheet named Refusal Boxplots. 
Label the chart appropriately. 

d. Save your workbook. Including the de¬ 
scriptive statistics and boxplot you’ve 
created, write a report detailing your 
findings. What conclusion do you draw 
from the data? Is there any specific in¬ 
formation that this data sample is lack¬ 
ing? Include a discussion of potential 
problems in this data set and how you 
would go about remedying them. 

14. Average teacher salary, public school 
spending per pupil, and ratio of teacher 
salary to pupil spending for 1985 have 


been stored in an Excel workbook. The 

values are broken down by state and area. 

You’ve been asked to calculate statistics 

on teacher salaries on the basis of the data. 

a. Open the Teacher workbook from 
the Chapter04 folder and save it as 

Teacher Salaries. 

h. Create a table of univariate statistics 
except the mode for the teacher sala¬ 
ries broken down by area and overall. 
Save the table on the worksheet 

Salary Statistics. 

c. Create a boxplot of the teacher sala¬ 
ries broken down by area on a chart 
sheet named Salary Boxplots. 

d. Discuss the distribution of the teacher 
salaries for each area. There is an ex¬ 
treme outlier in the west area. Which 
state is this? Discuss why salaries for 
teachers in this state might be so high. 

e. Create a table of univariate statis¬ 
tics except the mode for the ratio of 
teacher salary to spending per pupil 
broken down by area and overall. 

Save the table on a worksheet named 
Salary Pupil Ratio Statistics. 

f. Create, on a chart sheet named Salary 
Pupil Ratio Boxplots, a boxplot of the 
ratio values broken down by area. 

g. For the state that was an outlier in 
the west area in terms of teacher 
salary, check to see if it is also 

an outlier in terms of the ratio of 
teacher salary to public spending 
per pupil. Estimate the percentile of 
this state’s salary/pupil ratio within 
the west area. How does that com¬ 
pare to its percentile for teacher’s 
salary alone? If the cost of educa¬ 
tion per pupil is indicative of the 
cost of living in a state, are teachers 
in this particular state overpaid or 
underpaid relative to other states in 
the west area? 

h. Save your changes to the workbook 
and write a report summarizing your 
observations and calculations. 
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15. You’ve been given an Excel workbook 
containing annnal salary fignres for ma¬ 
jor leagne baseball players (in terms of 
hnndreds of thonsands of dollars) for the 
2007 season. Use the workbook to calcn- 
late statistics on the players’ salaries. 

a. Open the Baseball workbook from the 
Chapter04 folder and save it as 
Baseball Salary Statistics. 

b. Create a histogram of the players’ sala¬ 
ries with bin intervals of $1,000,000 
ranging from $0 np to $25,000,000. 
Have the connts within each bin be > 
the bin valne and < the next bin valne. 

c. Create a freqnency table of the play¬ 
ers’ salaries, nsing the same bin inter¬ 
vals and options yon nsed to create 
the histogram. 

d. Calcnlate the 10th and 90th percen¬ 
tiles of the salaries. 

e. Using the valne for the 90th percen¬ 
tile, filter the player data to show only 
those players who were paid in the 
npper 10% of the salary range. 

f. What is the average player’s salary? 
What is the median player’s salary? 

If a player made the average salary, at 
what percentile wonld he be ranked 
in the data? 

g. Save yonr changes to the workbook 
and write a report snmmarizing yonr 
calcnlations. 

16. Yon’ve been asked to compare the 
changing natnre of baseball salaries. An 
Excel workbook has been prepared for 
yon that contains salaries from the years 
1985, 2002, and 2007. Examine these 
salaries and prepare a statistical report. 

a. Open the Salary Comparison work¬ 
book from the Chapter04 folder 
and save it as Salary Comparison 
Statistics. 

b. Create a histogram of the salary data 
broken down by year. Have the his¬ 
togram display salaries in $1,000,000 


intervals np to $25,000,000 and have 
the bars of the histogram display the 
percentages, not the freqnencies, for 
each bar. Display the histogram bars 
side by side rather than stacked. Does 
the distribntion of the salaries appear 
the same in the three years? 

c. Calcnlate the connt, mean, median, 
percentiles, skewness, and knrtosis 
for the three years. How does the typi¬ 
cal salary compare within the three 
years? What do players in the npper 
10% of each year make? 

d. Has the distribntion of the salaries 
become more skewed or less skewed 
or remained the same over the years? 
Answer this qnestion by examining 
the skewness and knrtosis statistics. 
Yon’ve learned that the 2007 data are 
jnst for starters. How conld this affect 
yonr conclnsion? 

e. Save yonr changes to the workbook 
and then write a report snmmarizing 
yonr calcnlations. 

17. The Cancer workbook contains data 
comparing the cigarette per capita for 
each of the 50 states and the District of 
Colnmbia to those state’s rates of bladder 
cancer, kidney cancer, Inng cancer, and 
lenkemia per 100,000. Each state was 
ranked on the basis of cigarette nse with 
0 for low rates of cigarette nse, 1 for me- 
dinm, and 2 for high. Analyze the data 
from this workbook. 

a. Open the Cancer workbook from the 
Chapter04 folder and save it as 

Cancer Statistics. 

b. Create boxplots of the rates of bladder 
cancer, kidney cancer, Inng cancer, 
and lenkemia broken down by ciga¬ 
rette nse. Label the charts and chart 
sheets appropriately. 

c. Create a table of nnivariate statistics 
for the rate of each illness broken 
down by the cigarette nse category. 
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d. Does there appear to be any relation¬ 
ship between these illnesses and the 
level of cigarette nse in the states? 
Defend yonr answer with yonr charts, 
statistics, and tables. 

e. There is one state with a high level of 
cigarette nse bnt a relatively low level 
of Inng cancer. Identify this state. 

f. Save yonr workbook and then write a 
report snmmarizing yonr observations 
and calcnlations. 

18. The Pollntion workbook contains air 
qnality data collected by the Environ¬ 
mental Protection Agency (EPA). The 
data show the nnmber of nnhealthfnl 
days (heavy levels of pollntion) per year 
for 14 major U.S. cities in the year 1980 
and the average nnmber of nnhealthy 
days per year from 2000 throngh 2006. 
The workbook also contains the ratio of 
the 2000-2006 average to the 1980 valne 
and the difference. A ratio valne less 
than 1 or a difference valne less than 0 
indicates an improvement in the air 
qnality. Looking at the data as a whole, 
is there evidence to believe that there 
has been improvement in the air qnal¬ 
ity? Open this workbook and examine 
the data. 

a. Open the Pollution workbook from 
the Chapter04 folder and save it as 

Pollution Boxplots. 

b. Calculate the mean and median values 
of the ratio and difference variables. 

c. Create two boxplots. First create a 
boxplot of the ratio variable and then 
create another boxplot of the differ¬ 
ence variable. Describe the difference 
between the shape of the two distri¬ 
butions. Is one more susceptible to 
extreme values than the other? Why 
would this be case? [Hint: Think 
about the number of unhealthy days 
in 1980. Which cities are most likely 
to show the greatest drop in absolute 
numbers?) 


d. There is an extreme outlier in the box- 
plot of the difference values. Identify 
the city corresponding to that extreme 
outlier. 

e. Copy the air quality data to a new 
worksheet without the extreme outlier 
you noted in part d. Redo the table of 
statistics and boxplots with this new 
set of data. 

f. What are your conclusions? Have 
your conclusions changed without the 
presence of the outlier? What effect 
did the outlier have on the mean and 
median values of the ratio and differ¬ 
ence variables? Are you justified in 
removing the outlier from your analy¬ 
sis? Why or why not? 

g. Save your workbook and then write a 
report summarizing your calculations 
and observations. Which variable 
seems to better describe the change in 
air quality: the difference or the ratio? 

19. The Reaction workbook contains reac¬ 
tion times from the first-round heats of 
the 100-meter race at the 1996 Summer 
Olympic games. Reaction time is the time 
elapsed between the sound of the starter’s 
gun and the moment the runner leaves 
the starting block. The workbook also 
contains the heat number, the order of 
finish, and the finish group (1st through 
3rd, 4th through 6th, and so forth). 

a. Open the Reaction workbook from the 
Chapter04 folder and save it as Reac¬ 
tion Statistics. 

b. Calculate univariate descriptive statis¬ 
tics for the reaction times listed. What 
are the average, median, minimum, 
and maximum reaction times? 

c. Create a boxplot of the reaction times. 
Are there any moderate or extreme 
outliers in the distribution? How 
would you characterize the shape of 
the distribution? 

d. Create a stem and leaf plot of the 
reaction times. 
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e. Reaction times are important for de¬ 
termining whether a rnnner has false- 
started. If the rnnner’s reaction time 
is less than 0.1 second, a false start e. 


is declared. Where wonld a reaction 
time of 0.1 second fall on yonr 
hoxplot: as a typical valne, a moderate 
ontlier, or an extreme ontlier? Does 
this definition of a false start seem 
reasonable given yonr data? 

f. Is there an association between re¬ 
action time and the order of finish? 
Calcnlate descriptive statistics for the 
reaction times broken down by order 
of finish. Pay particniar attention to 
the mean and the median. 

g. Create a hoxplot of the reaction times 
broken down by order of finish. Is 
there anything in yonr descriptive 
statistics or boxplots to snggest that 
reaction time plays a part in how the 
rnnner finishes the race? 

h. Save yonr changes to the workbook 
and then write a report snmmarizing 
yonr observations and calcnlations. 

20. The Labor Force workbook shows the 
change in the percentage of women 
in the labor force from 19 cities in the 
United States from 1968 to 1972. Yon 
can nse these data to gange the growing 
presence of women in the labor force 
dnring this time period. 

a. Open the Labor Force workbook from 
the Chapter04 folder and save it as 
Labor Force Statistics. 

b. Calcnlate the difference between the 
1968 and 1972 valnes, storing the cal¬ 
cnlations in a new colnmn. Calcnlate 
descriptive statistics for the valnes in 
the Difference colnmn. 

c. Calculate the mean of the Difference 
value. 

d. Create a hoxplot of the Difference 
value. Are there any outliers present 
in the data? Identify which city the 
value comes from. What do the data 


tell you about the change of the pres¬ 
ence of women in the labor force from 
1968 to 1972? 

Describe the shape of the distribution 
of the Difference values. Are the data 
positively or negatively skewed or 
symmetric? Can you use the mean to 
summarize the results from this study? 
f. Save your workbook and write a re¬ 
port summarizing your analysis. 

21. In 1970, draft numbers were determined 
by lottery. All 366 possible birth dates 
were placed in a rotating drum and se¬ 
lected one by one. The first birth date 
drawn received a draft number of 1, and 
men born on that date were drafted first; 
the second birth date received a draft 
number of 2; and so forth. Data from the 
draft number lottery can be found in the 
Draft workbook. 

a. Open the Draft workbook from the 
Chaptered folder and save it as Draft 
Statistics. 

b. Create a box plot of the draft numbers 
broken down by month. Also create a 
table of counts, means, medians, and 
standard deviations. Is there any evi¬ 
dence of a trend in the draft numbers 
selected compared to the month? 

c. Repeat part b, this time breaking the 
numbers down by quarters. Is there 
any evidence of a trend between draft 
numbers and the year’s quarter? 

d. Repeat part b, breaking the draft num¬ 
bers by first half of the year versus 
second half. Is the typical draft num¬ 
ber selected for the first half of the 
year close in value to the draft num¬ 
ber for birthdays from the second half 
of the year? 

e. Discuss your results. The draft num¬ 
bers should have no relationship to the 
time of the year. Does this appear to be 
the case? What effect does breaking the 
numbers down into different units of 
time have on your conclusion? 
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f. Save your workbook and write a re¬ 
port summarizing your investigation 
and observations. 

22. Cuckoos are known to lay their eggs in 
the nests of other host birds. The host 
birds adopt and then later hatch the 
eggs. The Eggs workbook contains data 
on the lengths of cuckoo eggs found in 
the nest of host birds. You’ve been asked 
to compare the length of the cuckoo eggs 
placed in the different nests. 

a. Open the Eggs workbook from the 
Chaptered folder and save it as 

Egg Statistics. 

b. Create a boxplot of the egg lengths for 
the different birds. 

c. Calculate descriptive statistics of the 
egg lengths. 


d. One theory holds that cuckoos lay 
their eggs in the nests of a particu¬ 
lar host species and that they mate 
within a defined territory. If true, this 
would cause a geographic subspecies 
of cuckoos to develop and natural 
selection would ensure the survival 
of cuckoos most fitted to lay eggs that 
would be adopted by a particular 
host. If cuckoo eggs differed in length 
between hosts, this would lend some 
weight to that hypothesis. Do the 
data indicate a possible difference in 
cuckoo egg lengths between hosts? 
Explain. 

e. Save your changes to the workbook 
and write a report summarizing your 
observations. 
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Chapter 5 


Probability Distributions 



In this chapter you will learn to: 
p- Work with random variables and probability distribntions 
p- Generate random normal data 
P- Create a normal probability plot 
P- Explore the disbibntion of the sample average 
P- Apply the Central Limit Theorem 
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U p to now, you’ve used tools such as frequency tables, descriptive 
statistics, and scatter plots to describe and summarize the proper¬ 
ties of your data. Now you’ll learn about probability, which pro¬ 
vides the foundation for understanding and interpreting these statistics. 
You’ll also be introduced to statistical inference, which uses summary 
statistics to help you reach conclusions about your data. 


Probability 

Much of science and mathematics is concerned with prediction. Some of 
these predictions can be made with great precision. Drop an object, and the 
laws of physics can predict how long the object will take to fall. Mix two 
chemicals, and the laws of chemistry can predict the properties of the re¬ 
sulting mixture. Other predictions can be made only in a general way. Flip 
a coin, and you can predict that either a head or a tail will result, but you 
cannot predict which one. That doesn’t mean that you can’t say anything. If 
you flip the coin many times, you’ll notice that roughly half the flips result 
in heads and half result in tails. 

Flipping a coin is an example of a random phenomenon, in which 
individual outcomes are uncertain but follow a general pattern of 
occurrences. 

When we study random phenomena, our goal is to quantify that general 
pattern of occurrences in order to make general predictions. How do we do 
this? One way is through theory. We imagine an ideal coin with two sides: 
a head and a tail. Because this is an ideal coin, we assume that each side 
is equally likely to occur during our coin flip. From this, we can define the 
theoretical prohahility for equally likely events: 

Number of possible ways of obtaining the event 

Theoretical probability = -;- ; - - -—- 

Total number of possible outcomes 

In the coin-tossing example, there is one way to obtain a head and there 
are two possible outcomes, so the theoretical probability of obtaining a head 
is 1/2, or .5. 

Another way of quantifying random phenomena is through observation. 
For example to determine the probability of obtaining a head, we repeatedly 
toss the coin. From our observations, we calculate the relative frequency of 
tosses that result in heads, where 

Number of times an event occurs 
Relative frequency = , , 

Number of replications 


Chapter 5 Probability Distributions 183 






Figure 5-1 shows a chart of the results of tossing such a coin 5,000 times. 


Figure 5-1 
The relative 
frequency of 
tossing a head 



t ttee 

Mumtof ol Coin 


Early in the experiment, the relative freqnency of heads jnmps aronnd 
qnite a hit, hovering above .5. As the nnmher of tosses increases, the relative 
frequency narrows to aronnd the .5 level. The law of large numbers states 
that as the nnmher of replications increases, the relative frequency will ap¬ 
proach the probability of the event, or, to put it another way, we can define 
the prohahility of an event as the value approached hy the relative frequency 
after an indefinitely long series of trials. 


Probability Distributions 

The pattern of prohahilities for a set of events is called a probability distri¬ 
bution. Probability distributions contain two important elements. 

1. The probability of each event or combination of events must range from 
0 to 1. 

2. The total probability is 1. 

In the coin-tossing example, there are two outcomes (head or tail], each 
with a probability of .5. The sum of both events is 1, so this is an example 
of a probability distribution. Probability distributions can be classified into 
two general categories: discrete and continuous. 
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Discrete Probability Distributions 


In a discrete probability distribution, the probabilities are associated with 
a series of discrete outcomes. The probabilities associated with tossing a 
coin form a discrete distribution since there are two separate and distinct 
outcomes. If you toss a 6-sided die, the probabilities associated with that out¬ 
come also form a discrete distribution, where each side has a ^ probability of 
turning up. We can write this as 

piy) = y;y= i, 2, 3 ,4, 5, 6 

6 


where p(y) means the “probability of y,” for integer values of/ranging from 
1 to 6. 

Note that discrete does not mean “finite.” There are discrete probability 
distributions that cover an infinite number of possible outcomes. One of 
these is the Poisson distribution, used when the outcome event involves 
counts within a specified period of time. The equation for the Poisson dis¬ 
tribution is 

Poisson Distribution 


piy) = —re ^ y = 0,1, 2, .. . 

y\ 


where A (pronounced “lambda”) is the average number of events in the 
specified time period and y\ stands for “y factorial,” which is equal to the 
product y(y— l)(y — 2) ■■■ (3)(2)(1). For example, 41=4X3X2X1 = 24. 
Lambda is an example of a parameter, a term in the formula for a probabil¬ 
ity distribution that defines its shape and values. Let’s see what probabilities 
are generated for a specific value of A. 

Suppose we want to determine the number of car accidents at an intersec¬ 
tion in a given year and we know that the average number of accidents is 3. 
What is the probability of exactly two accidents occurring that year? The 
Poisson distribution usually applies to this situation. In this case, the value of 
A is 3, y = 2, and the probability is 



9 • 0.0498 
2 • 1 


0.224 


or the probability of exactly two accidents occurring at the intersection is 
about 22%. Note that the probabilities extend across an infinite number of 
possible integer values. 

Discrete distributions can be displayed with a bar chart in which the 
height of each bar is proportional to the probability of the event. Figure 5-2 
displays the probability distribution from y=0to/=10 accidents per year. 
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Figure 5-2 
Poisson 
probability 
distribution 
for car 
accident 
data 
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To find the probability of a set of discrete events, we simply add np the 
individnal probabilities of each event in the set. So to find the probability of 
two or fewer accidents occurring at the intersection we add up the probabil¬ 
ities of no accidents (.050), 1 accident (.149), and 2 accidents (.224) to arrive 
at an overall probability of .423, or about 42%. Because the total probability 
is 1, the probability of more than two accidents occurring at the intersection 
would thus be 58%. 


Continuous Probability Distributions 

In continuous probability distributions, probabilities are assigned to a range 
of continuous values rather than to distinct individual values. For example, 
consider a person shooting at a target. The distribution of shots around the 
bull’s eye follows a continuous distribution. If the shooter is good, the prob¬ 
ability that the shots will cluster closely around the bull’s eye is very high 
and it is unlikely that a shot will miss the target entirely. 

Continuous probability distributions are calculated using a probability 
density function (PDF). When we plot a PDF against the range of possible 
values, we get a curve in which the curve’s height indicates the position of 
the most likely values. Figure 5-3 shows a sample PDF curve. 
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Figure 5-3 
A sample 
probability 
density 
function 


D«n«Ry Function 



V«Mt 


The probability associated with a range of values is equal to the area un¬ 
der the PDF curve. Note that unlike discrete distributions, we can’t assign a 
probability to a specific value in a continuous distribution. The probability 
of any specific value is zero because its area under the curve is zero (it has 
a positive height, but zero width). The total area under any PDF curve must 
be equal to 1 because the total probability must be 1. 



CONCEPT TUTORIALS 

PDFs 


To see the relationship between probability and the area under the PDF 
curve, open the instructional workbook named Probability. 


To open the Probability workbook: 

1 Open the file Probability, located in the Explore folder. 

The workbook opens to the Contents page, describing the nature of 
the workbook. 

2 You can move through the sheets in the workbook, reviewing the 
material on probability discussed so far in this chapter. 

3 Click Explore a PDF from the Table of Contents column. 

The sheet displays the curve shown in Figure 5-4. 
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Figure 5-4 
The 

Probability 

Explore 

workbook 




f • ^ U . ^ A VV' 


ImttrfPW 


X. ■>' eft 


t ■ % • A .1 


F.iplore %«ilh Fscel: Probabilit>'DUlribulionv 



Explore a PDF 

To (Mkov e«««« w4n ‘ 

PDF cw>eftl«« w 
pioNUtiv. tn M«i«| tb« 
Kicaton W »>eic fn < i « 10 
<ikiX«(e the i nc hai liw cf • 
valM rafcBf m ih« tdoacd 


PrabebM^ DefthUy Functon 



•see -300 -100 ooo 100 300 soo 


prateM«reMi<MMiM bhMMM sc s« r eceiel 


Notice the two horizontal scroll hars helow the chart. You’ll use these to 
set the range boundaries on the PDF curve. Use them now to select the range 
from —1 to 1 on the curve. 

To set the range boundaries on the PDF curve: 

1 Click or drag the scroll button of the bottom scrollbar until the right 
boundary equals 1. 

2 Click or drag the scroll button arrow of the top scrollbar until the 
left boundary equals —1. 

Figure 5-5 shows the PDF with the range from —1 to 1 selected. 
The area of this range is equal to 0.6827. 
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Figure 5-5 
Selecting 
the range 
from - I to I 



Because the area under the curve is equal to 0.6827, the prohahility 
of a value falling hetw^een —1 and 1 is eqnal to 68.27%. Try experi¬ 
menting with other range honndaries nntil yon get a good feel for 
the relationship between prohahilities and areas nnder the cnrve. 

3 Click the left scroll arrow nntil the left honndary eqnals —3. 

4 Click the right scroll arrow nntil the right honndary eqnals 3. 

Note that when yon move the lower honndary to —3 and the npper 
honndary to 3, the prohahility is 99.73%, not 100%. That is hecanse 
this particnlar prohahility distrihntion extends from minns infinity 
to pins infinity; the area nnder the cnrve (and hence the prohahility) 
is never 1 for any finite range. 

5 Close the Prohahility workbook. Yon do not have to save any changes. 


Random Variables and Random Samples 

Valnes from a discrete or continnons probability distrihntion fnnction 
are manifested in a random variable where a random variable is a vari¬ 
able whose valnes occur at random, following a probability distrihntion. 
A discrete random variable comes from a discrete probability distrihntion. 
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and a continuous random variable comes from a continuous probability 
distribution. Random variables are usually written with a capital letter, 
whereas lowercase letters are used to denote a particular value that the ran¬ 
dom variable may attain. For example, if the random variable Y follows a 
Poisson distribution, the probability that Y is equal to y is written as 

P(i^ = 7) =4^"“' 

When a random variable is assigned a value (such as when we flip a coin 
or record the number of traffic accidents in a year), that value is called an 
observation. A collection of several such observations is called a sample. If 
the observations are generated or selected in a random fashion with no bias, 
then the sample is known as a random sample. 

In most cases, we want our samples to be random samples to give a true 
picture of the underlying probability distribution. For example, say we cre¬ 
ate a study of the weight of United States adult males. We want to know 
what type of values we would be likely to get if we picked a man at ran¬ 
dom and weighed him (here weight is our random variable). However, if our 
sample is biased by selecting only men in their twenties, it would not be a 
true random sample of all United States adult males. Part of the challenge of 
statistics is to remove all bias from sampling. 

This is difficult to do, and subtle biases can creep into even the most 
carefully designed studies. By observing the distribution of values in a ran¬ 
dom sample, we can draw some conclusions about the underlying probabil¬ 
ity distribution. As the sample size increases, the distribution of the values 
should more closely approximate the probability distribution. To return to 
our example of the shooter, by observing the spread of shots around the bull’s 
eye, we can estimate the probability distribution related to the shooter’s 
ability to hit the target. 



CONCEPT TUTORIALS 

Random Samples 


You can use the instructional workbook Random Samples to explore the 
relationship between a probability distribution and a random sample. 

To use the Random Samples workbook: 

1 Open the file Random Samples from the Explore folder. Enable any 
macros in the workbook. 

2 Move through the sheets in this workbook, viewing the material on 
random variables and random samples. 

3 Click Explore a Random Sample from the Table of Contents column. 
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Figure 5-6 
Accuracy 
dialog box 


In this worksheet, you can click the Shoot button to generate a ran¬ 
dom sample of shots at the target. You can select the underlying proh- 
ahility distribution and the number of shots the shooter takes. Try this 
now with the accuracy of the shooter set to moderate to create a sample of 
50 random shots. 

To generate a random sample of shots: 

1 Click the Shoot button. 

2 Click the Moderate button and click the spin arrow to reduce the 
number of shots to 50. 

See Figure 5-6. 





3 Click the OK button twice. 

Excel generates 50 random shots as shown in Figure 5-7 (your random 
sample will be different). 
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Figure 5-7 
Randomly 
generated 
sample 
of shots 



The xy coordinate system on the target shows the hnlTs eye, located at the 
origin (0, 0). The distribution of the shots around the target is described by a 
bivariate density function because it involves two random variables (one for 
the vertical location and one for the horizontal location of each shot). We’ll 
concentrate on the horizontal distribution of the shots. 

Although many of the shots are near the bull’s eye, about a third of them 
are farther than 0.4 horizontal unit away, either to the left or to the right 
of the target. Because these are random data, your values may be different. 
Based on the accuracy level you selected, a probability distribution show¬ 
ing the expected distribution of shots to the left or right of the target is also 
generated in the second column of the table. In this example, the predicted 
proportion of shots within 0.4 unit of the target is 68.3%, which is close to 
the observed value of 70%. In other words, the distribution predicts that a 
person of moderate ability is able to hit the bull’s eye within 0.4 horizontal 
unit about 68% of the time. This person came pretty close. 

You can also examine the distribution of these shots by looking at the his¬ 
togram of the shots. For the purposes of this worksheet, a shot to the left of 
the target has a negative value and a shot to the right of the target has a posi¬ 
tive value. The solid curve is the probability density function of shots to the 
left or right of the target. After 50 shots, the histogram does not follow the 
probability density function particularly closely. As you increase the num¬ 
ber of shots taken, the distribution of the observed shots should approach 
the predicted distribution. 

To increase the number of shots taken: 

Click the Shoot button again. 

Click the Moderate button and click the spin arrow to increase the 
number of shots to 500. 
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Figure 5-8 compares the distribution of the random sample after 
50 shots and after 500 shots. Note that the larger sample size more 
closely follows the nnderlying probability distribntion. 


Figure 5-8 
Distribution 
after 50 
and 500 
observations 



sample size = 50 


sample size = 500 


Try generating some more random samples of various sample sizes. 
When yon are finished, close the Random Samples workbook. 


The Normal Distribution 

In the Exploring Random Samples workbook, you worked with a distribu¬ 
tion in the form of a bell-shaped cnrve, called the normal distribution. This 
common probability distribution is probably the most important distribn¬ 
tion in statistics. There are many real-world examples of normally distrib- 
nted data, and normally distributed data are assumed in many statistical 
tests (for reasons yon’ll understand shortly). The probability density func¬ 
tion for the normal distribntion is 

Normal Probability Density Function 


f{y) = - e cr > 0, — oo</r<oo, — oo<y<oo 

(TV 2TT 

The normal distribntion has two parameters, /r (prononnced “mu”) and a 
(prononnced “sigma”). The /r parameter indicates the center, or mean, of the 
distribntion. The cr parameter measures the standard deviation, or spread, of the 
distribntion. To see how these parameters affect the distribution’s location and 
shape, yon can work with the instrnctional workbook named Distributions. 
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CONCEPT TUTORIALS 

The Normal Distribution 


To explore the normal distribution: 

1 Open the file Distributions, located in the Explore folder. Enable the 
macros contained in the workhook. 

2 Click Normal from the Table of Contents colnmn. The Normal work¬ 
sheet opens as shown in Fignre 5-9. 


Figure 5-9 
The Normal 
Distribution 
worksheet 
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The Normal workbook opens with /r set to a valne of 0 and cr set to a valne 
of 1. A normal distribntion with these parameter valnes is referred to as 
standard normal. Now observe what happens as yon alter the valnes of jjb and a. 

To change the values of fi and a". 

I Click the up spin button next to the mn box to change the valne of /r to 2. 

Note that the distribntion shifts to the right as the center of the dis¬ 
tribntion now lies over the valne 2. 
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Click the down spin button next to the mu hox to change the value 
of /u hack to 0. 

Click the down spin button next to the sigma hox to reduce the value 
of (7 to 0.3. The distrihution tightens around the center. 

Click the up spin button next to the sigma hox to increase the value of 
(T to 1.5. The distrihution spreads out, indicating a wider range of 
prohahle values. 

Figure 5-10 shows the normal curve for a variety of /r and a values. 


Figure 5-10 
The normal 
distribution 
for varying 
values of cr 


|1 = 0, cr=l ft = 0, CT = 0.5 ft = 0, cr=1.5 




Examine other values of /r and a to continue to explore how those 
changes affect the distrihution. Close the Distrihutions workbook 
without saving any changes when you're finished. 


In the normal distrihution, about 68.3% of the values lie within 1 cr, or 

1 standard deviation, of the mean /r. About 95.4% of the values lie within 

2 standard deviations of the mean, and more than 99% of the values lie 
within 3 standard deviations of the mean. See Figure 5-11. 
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Figure 5-11 
Probabilities 
under the 
normal 
curve 


number of 
standard deviations 
away from 
the mean 




Because normally distributed data appear so often in statistical studies, 
these benchmarks are an important rule of thumb. For example, if you are 
trying to calculate a range that will incorporate most of the data, taking the 
mean ±2 standard deviations is a fast way of estimating that range. 


Excel Worksheet Functions 

Excel includes several functions to work with the normal distribution 
described in Table 5-1. 


Table 5-1 Functions with the Normal Distribution 

Function Description 

NORMDISTfy, mean, std_dev, type) Uses the normal distribution with /r = mean and 

cr = std_dev. Setting type = TRUE calculates 
the probability of U < jy. Setting type = FALSE 
calculates the value of the probability density 
function at y. 

NORMINVfp, mean, std_dev) Returns the value y from the normal distribution for 

/r = mean and a = std_dev, such that p[Y ^ y) = p. 

(continued] 
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NORMSDISTly) 


Returns the probability oiY ^ y for the standard 
normal distribntion. 

Retnrns the valne y from the standard normal 
distribntion snch that p[Y ^ y) = p. 

Calcnlates the probability from the normal 
distribntion with /r = mean and a = std_dev for the 
range lower ^ y^ npper. StatPlus required. 


NORMSINV(p) 


NORMBETW(7ower, upper, mean, 
std_dev) 


For example, if yon want to calcnlate the probability of a random variable 
from a normal distribntion with p, = 50 and tr = 4 having a valne < 40, ap¬ 
ply the Excel formnla 

=NORMDIST(40, 50, 4, TRUE) 

and Excel retnrns the valne .00621, indicating that there is a 0.621% prob¬ 
ability of a valne less than or eqnal to 40 from snch a distribntion. The valne 
of the PDF at that point retnrned by the formnla 

=NORMDIST(40, 50, 4, FALSE) 

is .004382, which is the height of the probability distribntion fnnction at 
that point in the PDF cnrve. On the other hand, if yon want to calcnlate a 
valne on the PDF for a particnlar probability, yon nse the NORMINV() fnnc¬ 
tion. The formnla 

=NORMINV(0.90, 50, 4) 

retnrns the valne 55.12621, indicating that in a normal distribntion with p 
= 50 and n = 4, there is a 90% probability that a random variable will have 
a valne of 55.12621 or less. 


Using Excel to Generate Random Normal Data 


Now that yon’ve learned a little abont the normal distribntion, yon can nse 
Excel to randomly generate observations from a normal distribntion. YonTl 
start by creating a single sample of 100 observations coming from a normal 
distribntion with p = 100 and a = 25. To do this, yon need to have the 
StatPlns add-in installed on Excel. 

To create 100 random normal values: 

I Open a new blank workbook in Excel, click cell Al, and type Normal 

Data in the cell. 
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Click Create Data from the StatPlus menu and then click Random 
Numbers. 


3 

4 

5 

6 

7 

8 


The Random Numbers command presents a dialog hox from which 
you can create random samples from a large variety of distributions. 
In this case you’ll choose the normal distribution. 

Click Normal from the Type of Distribution list box. 

Type 1 in the Number of Samples to Generate box. 

Type 100 in the Size of Each Sample box. 

Type 100 in the Mean box. 

Type 25 in the Standard Deviation box. 

Click the Output button, click the Cell option button, and select 
cell A2 as your output destination. Click the OK button to close the 
Output Options dialog box. 

Figure 5-12 shows the completed Create Random Numbers dialog box. 


Figure 5-12 
The Create 
Random 
Numbers 
dialog box 



9 Click the OK button. 

\ 


Excel generates a random sample of 100 observations following a normal 
distribution with mean 100 and standard deviation 25. See Figure 5-13. 
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Figure 5-13 
One 
hundred 
random 
normal 
observations 



Because these are randomly generated values, your numbers will look 
different. 


EXCEL TIPS 



You can also create random samples nsing the Analysis ToolPak 
add-in available with Excel. To create a random sample, load the 
Analysis ToolPak, click the Data Analysis bntton from the 
Analysis group on the Data tab and select Random Number 
Generation from the Data Analysis dialog box. 

StatPlus adds several new functions to Excel to generate random 
numbers, including the RANDNORM command to create a random 
number from a normal distribution. 


Charting Random Normal Data 

Now that yon’ve created a random sample of normal data, yonr next task is 
to create a histogram of the distribution. The StatPlns Histogram command 
also inclndes an option to overlay a normal curve on your histogram to com¬ 
pare the distribution of yonr data with the normal distribntion. 
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To create a histogram of the random sample: 

1 Click Single Variable Charts from the StatPlus menu and then click 

Histograms. 

2 Click the Data Values button, click the Use Range References 
option button, and select the range Al:Al01. Click the OK button. 

3 Click the Normal curve checkbox. 

4 Click the Output button and type Normal Histogram in the As a 
New Chart Sheet box to send the chart to a new chart sheet. Click 
the OK button. 

Figure 5-14 shows the histogram of the randomly generated values 
from the normal distribution 


Figure 5-14 
Histogram 
of the 100 
random 
normal 
values 



The histogram does not follow the normal curve exactly, but as you saw 
earlier, if you increase the size of the random sample, the distribution of the 
sample values approaches the underlying probability distribution. A sample 
size of 100 is perhaps still too small. Because you generated these values, 
you already know that the data are normally distributed, but suppose you 
observed these values in an experiment or study. Would the chart shown 
in Figure 5-14 convince you that you’re working with normal data? It’s not 
always easy to tell from a histogram whether your data are normal, so statis¬ 
ticians have developed some procedures to check for normality. 
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The Normal Probability Plot 


To check for normality, statisticians compnte normal scores for their data. 
A normal score is the valne yon would expect if your sample came from a 
standard normal distrihution. As an example, for a sample size of 5, here are 
the five normal scores: 

-1.163, -0.495, 0, 0.495, 1.163 

To interpret these numbers, think of generating sample after sample of 
standard normal data, each sample consisting of five observations. Now, 
take the average of the smallest value in each sample, the second smallest 
value, and so forth up to the average of the largest value in each sample. 
Those averages are the normal scores. Here, we would expect the largest 
value from a random sample of five standard normal values to be 1.163 and 
the smallest to be —1.163. 

Once you’ve generated the appropriate normal scores, plot the largest 
value in your data set against the largest normal score, the second largest 
value against the second largest normal score, and so forth. This is called 
a normal probability plot. If your data are normally distributed, the points 
should fall close to a straight line. 

StatPlus includes a command to calculate normal scores and create a 
normal probability plot. Use it now to plot your random sample of normal 
data. 

To create a normal probability plot: 

1 Click Single Variable Charts from the StatPlus menu and then click 

Normal P-plots. 

2 Click the Data Values button, click the Use Range References 
option button, and select the range Al:Al01 on your worksheet. 
Click the OK button. 

3 Click the Output button and type Normal P-plot in the As a New 
Chart Sheet box to send the chart to a new chart sheet. Click the OK 
button. 

4 Click the OK button to start creating the normal probability plot. 

Figure 5-15 shows the resulting plot (yours will look slightly different 
because you’ve generated a different set of random values). 
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Figure 5-15 
Normal 
probability 
plot 



The points follow a general straight line trend fairly well, hnt with 
some departnres at the left end of the scale. If the sample size were 
larger it might follow the line even more closely. 

Close yonr workbook. Yon do not have to save any of the random 
data or plots yon created. 


Let’s apply this techniqne to some real data. The Baseball workbook from 
the ChapterOS data folder contains information abont baseball player salaries 
and batting averages. Do the batting averages follow a normal distribntion? 
Let’s find ont. We’ll start by creating a histogram of the batting average data. 

To create a histogram of the hatting average data: 

□pen the Baseball workbook located on the companion website. 
Save the workbook as Baseball Batting Averages 

Click Single Variable Charts from the StatPlns menn and then click 

Histograms. 

Click the Data Values bntton, select AVG from the list of range 
names, and click OK. 
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5 Click the Normal Curve checkbox. 

6 Click the Output button and send the histogram to a chart sheet 
named Batting Average. 

7 Click the OK button to start creating the histogram and normal 
curve. 

Figure 5-16 shows the resulting chart. 


Figure 5-16 
Distribution 
of the 
batting 
average 
data 



The distribution of the batting average values appears to follow the su¬ 
perimposed normal curve pretty well (certainly no worse than the sample of 
random numbers generated earlier). There is no indication that the batting 
averages do not follow the normal distribution. Let’s further check this as¬ 
sumption with a normal probability plot. 

To create a normal probability plot of the batting average data: 

1 Return to the Baseball Salaries worksheet. 

2 Click Single Variable Charts from the StatPlus menu and then click 

Normal P-plots. 
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3 Click the Data Values button and select AVG from the list of range 
names. Click the OK bntton. 

4 Click the Output button and send the probability plot to a chart 
sheet named Batting Average P-plot. 

5 Click the OK button to start creating the normal probability plot. See 
Figure 5-17. 


Figure 5-17 
Normal 
probability 
plot of the 
batting 
average 
data 
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Save your changes and close the workbook. 


The batting average data follow a general straight line trend on the nor¬ 
mal probability plot pretty well. The only serious departures from the line 
occur at either end of the distribution of normal scores. You can examine 
the normal scores to determine what the batting averages at the end of the 
distribution would be if the sample data more closely followed the nor¬ 
mal distribution. 

To convert a normal score back to the scale of the original data, you mul¬ 
tiply the normal scores by the standard deviation of the observed values 
and then add the sample average. In this case, the average batting average is 
0.275528, and the standard deviation is 0.022529. If the largest normal score is 
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2.812, this translates into an expected batting average of 2.812 X 0.022529 + 
0.275528 = 0.3388—a valne slightly higher than the observed maximnm 
batting average of 0.333. On the other end of the scale the lowest normal 
is -2.812, which corresponds to batting average of 0.2122, greater than the 
observed valne of 0.19. So althongh the batting average appears to generally 
follow the normal distribntion, the valnes at either end of the sample are 
less than wonld be expected from normal data. 

One of the advantages of the normal probability plot is that if yonr data 
are skewed in either the positive or the negative direction, this will be 
clearly displayed in the plot. Positively skewed data fall below the straight 
line on both ends of the plot, whereas negatively skewed data rise above 
the straight line at both ends of the plot. Fignre 5-18 shows a histogram and 
normal probability plot of the salaries of the baseball players in the work¬ 
book. The data are clearly not normal as the distribntion is heavily weighted 
toward lower salaries. The salaries are below the line at both ends becanse 
of positive skewness. 


Figure 5-18 
Distribution 
of the 
baseball 
player 
salary 
data 



Parameters and Estimators 

When investigating the properties of a probability density fnnction the pa¬ 
rameter valnes of the fnnction were known; however, most of the time we 
don’t know the valnes of these parameters, so we have to nse the data to 
estimate them nsing statistics. For example in the normal distribntion, we 
have two parameters /r and a. We can estimate the valne of /r by calcnlating 
the sample average x and the valne of cr by calcnlating the sample standard 
deviation s (see Chapter 4 for a description of these statistics). 

The valnes x and s have a special and important property: They are not 
only estimators of /r and cr bnt are also consistent estimators, which means 
that as the size of the random sample is increased the valnes of x and s come 
closer and closer to the trne parameter valnes. With a large enongh sample 
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size, X and s will estimate the true values of fi and a to whatever degree of 
precision is required. Figure 5-19 shows how the value of x approaches the 
value of /r as the sample size increases. 


the sample average approaches 
the value of // as the number 
of observations in the 



The key question is How large must the sample he to estimate accurately 
the value of /r? To answer this question, we have to examine the properties 
of the sample average. 


The Sampling Distribution 

Because the sample average is calculated hy taking the average of random 
variables, it is also a random variable following its own probability distribu¬ 
tion called the sampling distribution. If we know the form of the sampling 
distribution, we can make inferences about the sample average. 

We’ll start our exploration of the sample average by creating nine random 
samples of standard normal data, each sample containing 100 observations. 
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From those random values, we’ll create a new column of data containing the 
average of each of the samples. The distrihution of those sample averages 
will approximate the underlying sampling distrihution. 

To create 100 samples with nine observations each: 

1 Open a new blank workbook in Excel. 

2 Click Create Data from the StatPlus menu and then click Random 
Numbers. 

3 Select Normal from the Type of Distribution list box. 

4 Enter 16 in the Number of Samples to Generate box. 

5 Enter 100 in the Size of Each Sample box. 

6 Enter 0 in the Mean box and 1 in the Standard Deviation box. 

7 Click the Output button and select cell Al as the output cell on the 
current worksheet. Click the OK button. 

The completed dialog box is shown in Figure 5-20. 


Figure 5-20 
The completed 
Create Random 
Numbers 
dialog box 
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Click the OK button to start generating the random samples. 
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Excel creates 16 columns each containing 100 rows of random values 
from a standard normal distribution. Now take the average of each row; this 
provides you with 100 rows of sample averages with each average drawn 
from a sample of 16 random normal values. 

To calculate the averages of the 1,000 random samples: 

1 Click cell Ql, type the formula =average(Al:Pl), and press Enter. 

2 Click cell Ql again and drag the fill handle down to cover the range 
Q1:Q100. Column Q now contains the average of each of the 100 
samples on the worksheet. 


The column of averages you just created should be much less variable 
than each of the individual samples, because the average smoothes out the 
highs and the lows of the values found within each sample. What kind of 
distribution does it have? Let’s investigate by creating a histogram of the 
sample averages. 


1 

2 

3 

4 

5 


To create a histogram of the sample averages: 

Click Single Variable Charts from the StatPlus menu and then click 

Histograms. 

Click the Data Values button, click the Use Range References option 
button, and select the range P1:P100. Deselect the Range Includes 
Row of Column Labels checkbox. Click the OK button. 

Click the Normal Curve checkbox. 

Click the Output button and save the histogram to a chart sheet 
named Sample Average Histogram. 

Click the OK button to start creating the histogram. 

Figure 5-21 shows the resulting histogram (yours will differ since it 
comes from a different random sample). 
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Figure 5-21 
Histogram 
of sample 
averages 



The distribution of the sample averages looks normal and, in fact, is nor¬ 
mal. The distribntion is centered at 0, as yon wonld expect for averages 
based on samples taken from the standard normal distribntion. The distri¬ 
bntion differs from the standard normal in one respect: The sampling dis¬ 
tribntion is mnch narrow^er aronnd the mean. Most of the standard normal 
values lie between —2 and 2, bnt here most of the valnes lie between —0.5 
and 0.5. Apparently the a for the sampling distribntion of the sample aver¬ 
age is smaller than the valne of a for a standard normal. To verify this, cal- 
cnlate descriptive statistics for the sample averages. 

To calculate descriptive statistics for the sample averages: 

1 Return to the worksheet containing the sample average valnes. 

2 Click Descriptive Statistics from the StatPlns menn and then click 
Univariate Statistics. 

3 Click the Input button, click the Use Range References option but¬ 
ton, and select the range P1:P100. Deselect the Range Includes a 
Row of Column Labels checkbox. Click the OK button. 

4 Click the Summary tab and click the Count and Average checkboxes. 

5 Click the Variability tab and click the Std. Deviation checkbox. 
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Figure 5-22 
Sample 
average 
statistics 


6 Click the Output button, and then click the New Worksheet option 
bntton and type Sample Average Statistics for the new worksheet 
name. Click the OK bntton. 

7 Click the OK bntton to generate the sample statistics. 

8 Select the range B4:B5 and rednce the nnmber of decimal places to 3. 
Fignre 5-22 shows the sample ontpnt (yonrs will be slightly different). 
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Close the workbook with the random samples. Yon don’t have to 
save yonr changes. 


As yon wonld expect, the average of the sample averages is near zero since 
the averages come from standard normal valnes in which /r is eqnal to zero. 
The standard deviation of the sample averages is 0.256. Thns the sampling 
distribntion of the average valnes taken from samples with sample sizes of 16 
appears to follow a normal distribntion with /r of 0 and a n of abont 0.25. 
Is there a relationship between sample size and the valne for the standard 
deviation? 
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^ CONCEPT TUTORIALS 

;/ Sampling Distributions 


Figure 5-23 
Variability 
Statistics 


You can use the exploration workbook Population Parameters to explore 
how sample size affects the distribution of the sample average. 

To explore sampling distributions: 

1 Open the file Population Parameters, from the Explore folder, en¬ 
abling the macros the workbook contains. 

2 The workbook contains information on parameters and sampling 
distributions. Review the material. 

3 Click Exploring Sampling Distributions from the Table of Contents. 

4 The worksheet displays a histogram of 150 sample averages. A scroll 
bar allows you to change the size of each sample from 1 up to 16. 

5 Move the scroll bar and observe how the shape of the distribution 
changes on the basis of the different sample sizes. Also note how the 
value of the standard deviation changes. 

Figure 5-23 shows the sampling distribution for different sample 
sizes. Do you see a pattern in the values of the standard deviation 
compared to the sample size? 


Figure 5-23 
Variability 
Statistics 
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Continue working with the Population Parameters workbook 
and close it when you’re finished. You do not have to save your 
changes. 


These values illustrate an important statistical fact. If a sample is com¬ 
posed of n random variables coming from a normal distribution with 
mean /r and standard deviation cr, then the distribution of the sample av¬ 
erage will also be a normal distribution with mean fju but with standard 
deviation For example, the distribution of a sample average of 16 

standard normal values is normal with a mean of 0 and a standard devia¬ 
tion of = 1/Vl6 = or .25. 


The Standard Error 

The standard deviation of x is also referred to as the standard error of x. 
The value of the standard error gives us the information we need to deter¬ 
mine the precision of x in estimating the value of /r. For example, suppose 
you have a sample of 100 observations that comes from a standard normal 
distribution, so that the value of /r is 0 and of a is 1. You’ve just learned that 
X is distributed no rma lly with a mean of 0 and a standard deviation of 0.1 
(because 0.1 = Ik/lOO). 

Let’s apply this to what you already know about the normal distribution, 
namely that about 95% of the values fall within 2 standard deviations of the 
mean. This means that we can be 95% confident that the value of x will be 
within 0.2 units of the mean. For example, if x = 5.3, we can be 95% con¬ 
fident that the value of /r lies somewhere between 5.1 and 5.5. To be even 
more precise, we can increase the sample size. If we want x to fall within 
0.02 of the value of /r 95% of the time, we need a sample of size 10,000, be¬ 
cause if X is 5.3 with a sample size of 10,000, we can be 95% confident that 
/r is between 5.28 and 5.32. Note that we can never discover the exact value 
of /r, but we can with some high degree of confidence narrow the band of 
possible values to whatever degree of precision we wish. 


The Central Limit Theorem 

The preceding discussion applied only to the normal distribution. What 
happens if our data come from some other probability distribution? 
Can we say anything about the sampling distribution of the average in 
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that case? We can, by means of the Central Limit Theorem. The Central 
Limit Theorem states that if yon have a sample taken from a probability 
distribntion with mean /r and standard deviation a, the sampling distribn- 
tion of X is approximately normal with a mean of /r and a standard devia¬ 
tion of al's/n. The remarkable thing abont the Central Limit Theorem is that 
the sampling distribntion of x is approximately normal, no matter what the 
probability distribntion of the individnal valnes is. As the sample size in¬ 
creases, the approximation to the normal distribution becomes closer and 
closer. Now you see why the normal distribution is so important in the 
field of statistics. 

To see the effect of the Central Limit Theorem, you can use the instruc¬ 
tional workbook named The Central Limit Theorem. 



CONCEPT TUTORIALS 

The Central Limit Theorem 


To use the Central Limit Theorem workbook: 

1 Open the Central Limit Theorem file from the Explore folder. 
Enable the macros in the workbook. 

2 Review, in the workbook, the concepts behind the Central Limit 
Theorem. 

3 Click Explore the Central Limit Theorem from the Table of 
Contents. 

The Central Limit Theorem worksheet opens. See Figure 5-24. 
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Figure 5-24 
The Central 
Limit 
Theorem 
workbook 
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The worksheet lets you generate 150 random samples from one of eight 
common prohahility distrihntions with np to 16 observations per sample. 

The worksheet also calculates and displays the distrihntion of the sample 
averages. Yon can change the sample size hy dragging the scroll har np or 
down. The worksheet opens, displaying the distrihntion of the sample aver¬ 
age for the standard normal distrihntion. To see how the worksheet works, 
move the scroll har, increasing the sample size from 1 to 16. 

To change the sample size: 

1 Drag the scroll har down. The sample size increases from 1 to 16. 

As yon drag the scroll har down, the histogram displays the change 
in the sample size, and the standard deviation decreases from 1 to 
ahont 0.25. 

2 Drag the scroll har hack np to retnrn the sample size to 1. 
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Note that if you want to view the histogram with different hin valnes under 
different sample sizes, yon can deselect the Constant Bin Valnes checkbox. 
When the hin valnes are the same, yon can compare the spread of the data 
from one histogram to another because the same bin values are used for all 
charts. Deselecting the Constant Bin Valnes checkbox fits the bin valnes to the 
data and gives yon more detail, bnt it’s more difficult to compare histograms 
from different sample sizes. Yon can also “freeze” the _y axis to retain the y 
axis scale from one sample size to another, making it easier to compare one 
chart with another. To scale the y axis to the data, nnselect the Freeze the Y- 
Axis checkbox. 

Now that you’ve viewed the sampling distribntion for the standard nor¬ 
mal, yon’ll choose a different distribntion from the list. Another commonly 
nsed probability distribntion is the nniform distribntion. In the uniform 
distribution, probability values are uniform across a continuous range of 
values. The probability density function for the uniform distribution is 

Uniform Probability Density Function 


f{y) = - - -oo<h<oo, -00 < a < b 

b — a 

where b is the upper boundary and a is the lower boundary of the distribu¬ 
tion. Figure 5-25 displays the Uniform distribution where a = —2 and b = 2. 



The mean /r and standard deviation a for the uniform distribution are 


P = 


b + a 
2 


b — a 

Vl2 


Thus if a = —2 and h = 2, /r = 0 and a = = 1.1547. 
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Having learned something abont the nniform distribntion, let’s observe the 
sampling distribntion of the sample average. 


^ To generate the sampling distribution of the uniform distribution: 

1 Click the Uniform option bntton on the Central Limit Theorem 
worksheet. 

2 Enter —2 in the Minimnm box and 2 in the Maximnm box. See 
Fignre 5-26. 


Figure 5-26 
Setting the 
parameters 
for the 
uniform 
distribution 



3 Click the OK button. 

Excel generates 150 random samples for the uniform distribution. 

The initial sample size is 1, which is equivalent to generating 150 
different observations from the uniform distribution. The initial av¬ 
erage value should be close to 0 and the initial standard deviation 
value should be close to 1.15. 

4 Drag the scrollbar down to increase the sample size from 1 to 16. 

Figure 5-27 shows the sampling distribution for the average un¬ 
der different sample sizes (your charts and values will be slightly 
different). 

You may want to unfreeze and freeze the y axis in order to display 
the histograms in a more detailed scale. 
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Figure 5-27 
Sampling 
distribution 
for average 
from the 
uniform 
distribution 






Try some of the other distrihutions in the list under various sample 
sizes. Close the workbook when you’re finished. You do not have to 
save your changes. 


A few final points about the Central Limit Theorem should be considered. 
First, the theorem applies only to probability distributions that have a finite 
mean and standard deviation. Second, the sample size and the properties of 
the original distribution govern the degree to which the sampling distribution 
approximates the normal distribution. For large sample sizes, the approxi¬ 
mation can be very good, whereas for smaller samples, the approximation 
might not be good at all. If the probability distribution is extremely skewed, 
a larger sample size will be necessary. If the distribution is symmetric, the 
sample size usually need not be very large. How large is large? If the original 
distribution is symmetric and already close in shape to a normal distribu¬ 
tion, a sample size of 15 or 20 should be large enough. For a highly skewed 
distribution, a sample size of 40 or 50 might be required. Usually the Central 
Limit Theorem can be safely applied if the sample size is 30 or more. 

The Central Limit Theorem is probably the most important theorem in 
statistics. With this theorem, statisticians can make reasonable inferences 
about the sample mean without having to know about the underlying prob¬ 
ability distribution. You’ll learn how to make some of these inferences in 
the next chapter. 
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Exercises 


1. Explore the following statistical concepts: 

a. Define the term random variable. 

b. How is a random variable different 
from an observation? 

c. What is the distinction between x 
and /r? 

2. A sample of the top 50 women-owned 
bnsinesses in Wisconsin is nndertaken. 

Does this constitnte a random sample? 
Explain yonr reasoning. Can yon make 
any inferences abont women-owned 
bnsinesses on the basis of this sample? 

3. The administration connts the nnmber 
of low-birth-weight babies born each 
week in a particnlar hospital. Assnme, 
for the sake of simplicity, that the rate of 
low-birth-weight births is constant from 
week to week. 

a. Of the distribntions that we have 
stndied, which one is applicable 
here? 

b. If the average nnmber of low- 
birth-weight babies is 5, what is the 
probability that no low-birth-weight 
babies will be born in a single week? 

c. The administration connts the low- 
birth- weight babies every week and 
then calcnlates the average connt for 
the entire year. What is the approxi¬ 
mate distribntion of the average? 

4. The resnlts of flipping a coin follow a 
probability distribntion called the 
Bernoulli distribution. A Bernoulli dis¬ 
tribution has two possible outcomes, 
which we’ll designate with the numeric 
values 0 and 1. The probability function 
for the Bernoulli distribution is 


Bernoulli Distribution 

P{Y = 1) = p, P{Y = 0) = 1 - p 0<p<l 

where p is between 0 and 1. For exam¬ 
ple, if we tossed an unbiased coin and 
indicated the value of a head with 1 and 
a tail with 0, the value of p would be .5 
since it is equally likely to have either a 
head or tail. 

The mean value of the Bernoulli distri¬ 
bution is p. The standard deviation 
is ^/p{l — p). In the flipping coin 
example, the mean value is equal 
to 0.5 and the standard deviation is 
Vo.5(1 - 0.5) = 0.5. 

a. You toss a die, recording a 1 for the 
values 1 through 3, and a 0 for values 
4 through 6. What is the mean value? 
What is the standard deviation? 

b. You toss a die, recording a 1 for a 
value of 1 or 2, and a 0 for the values 
3, 4, 5, and 6. What is the mean value? 
What is the standard deviation? 

c. You toss a die, recording a 1 for a value 
of 1, and a 0 for all other values. What 
is the mean value? What is the stan¬ 
dard deviation? 


5. If you flip 10 coins, what is the probabil¬ 
ity of getting exactly 5 heads? To answer 
this question, you have to refer to the 
Binomial distribution, which is the dis¬ 
tribution of repeated trials of a Bernoulli 
random variable. The probability func¬ 
tion for the Binomial distribution is 

Binomial Distribution 

P{Y=y) = y= Oh,2, ...,n 


where 



n\ 

y\{n-y)\ 
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where p is the prohahility of the event 
(such as getting a head] and n is the 
numher of trials. For example, to calcu¬ 
late the prohahility of getting exactly 5 
heads in 10 tosses, the formula is 

P{Y = 5) = - 0.5)^^““^^ = 0.2461 

or 24.6%. In other words, there is about 
a 1 in 4 chance of getting exactly 5 heads 
out of 10 tosses. To calculate the proh¬ 
ahility of getting 5 or fewer heads, we 
add the prohahilities for the individual 
numbers: p[Y = 0], p[Y = 1], p[Y = 2), 
p[Y = 3], p[Y= 4], and p[Y = 5). 

The mean of the Binomial distribu¬ 
tion is np. The standard deviation is 
^/np{l — p). For example, if we flip 100 
coins with p = .5, the expected number 
of heads is 100 X .5 = 50 and the stan¬ 
dard deviation is Vo.5 • 0.5 • 100 = 5. 

a. If you toss 20 coins, what is probabil¬ 
ity of getting exactly 10 heads? 

b. If you toss 50 coins, what is the ex¬ 
pected number of heads? What is the 
standard deviation? 

c. You toss 10 dice, recording a 1 for a 
1 or a 2, and a 0 for a 3, 4, 5, or 6, 
what is the expected total? What is 
the standard deviation? 

d. You toss 10 dice, recording a 1 for a 

1 and a 0 for the other numbers, what 
is the expected total? What is the 
standard deviation? 

6. Excel includes the BINOMDIST function 
to calculate values of the binomial dis¬ 
tribution. The syntax of the function is 

BINOMDIST(numher, trials, prob, 
cumulative) 

where number is the number of successes, 
trials is the number of trials, prob is the 
probability of success, and cumulative is 
TRUE to calculate the cumulative value of 


the probability function and FALSE to cal¬ 
culate the value for the specified number 
value. For example, the formula 

BINOMDIST(5, 10, 0.5, FALSE] 

returns the value 0.246. To calculate the 
cumulative value (in this case the prob¬ 
ability of getting 5 or fewer heads out of 
10], use the formula 

BINOMDIST(5, 10, 0.5, TRUE] 

which returns the value .623. Thus, 
there is a 62.3% probability of getting 
5 or fewer heads out of 10. Use Excel to 
answer the following questions: 

a. What is the probability of getting ex¬ 
actly 10 heads out of 20 coin tosses? 
What is the probability of getting 

10 or less? 

b. What is the probability of getting 10 
heads out of 15 coin tosses? What is 
the probability of getting more than 10? 

c. You toss 10 dice, recording a 1 for a 
1 or 2, and a 0 for a 3, 4, 5, or 6. 

If you total up your scores, what is 
the probability of scoring exactly 
3 points? What is the probability of 
scoring 3 or fewer? 

d. You toss 100 dice, recording a 1 for a 
1 and a 0 for the other numbers, what 
is the probability of recording a score 
of exactly 20? What is the probability 
of scoring 20 or less? 

e. If you toss a coin 10 times, what is 
the probability of recording 4, 5, or 
6 heads? 

7. The mean of the Poisson distribution 
is A and the standard deviation is \/a, 
where A is the expected count per 
interval. 

a. The number of accidents at a factory 
in a year follows a Poisson distribu¬ 
tion with an expected value of 10 ac¬ 
cidents per year. What is the value of 
A? What is the standard deviation? 
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b. If you collect 25 years of accident in¬ 
formation at this factory, how conld 
the nnmher of accidents per year he 
nsed to estimate x? What wonld he 
the standard error of this estimate? 

8. Excel inclndes a fnnction to calcnlate 
valnes of the Poisson distrihntion. The 
syntax of the fnnction is 

POISSON(x, mean, cumulative) 

where x is the nnmher of connts, mean 
is the expected nnmher of connts, and 
cumulative is TRUE to calcnlate the cn- 
mnlative valne of the prohahility fnnction 
and FALSE to calcnlate the density for 
the specified x valne. Use this fnnction 
to calcnlate the cnmnlative and specific 
prohahilities of the following valnes: 

a. A = 2, X =2 

b. A = 2, X =3 

c. A = 2, X =4 

d. A = 2, X =5 

9. Excel inclndes a fnnction to calcnlate 
the prohahilities associated with the 
standard normal distrihntion. The syn¬ 
tax of the fnnction is 

NORMSDIST(z) 

The fnnction retnrns the prohahility of a 
standard normal random variable having 
a valne < z. Use this fnnction to calcn¬ 
late the following prohahilities: 

a. z = .5 

b. z = 1 

c. z = 1.65 

d. z = 1.96 

e. What is the prohahility of a standard 
normal random variable having a 
valne of exactly 2.0? 

10. The Excel function 

NORMSINV(proh] 


retnrns the inverse of the standard 
normal distrihntion. For a given 
cnmnlative probability prob, the fnnc¬ 
tion retnrns the valne of z. Use this 
fnnction to calcnlate z valnes for the 
following probabilities of the standard 
normal distrihntion: 

a. .05 

b. .10 

c. .50 

d. .90 

e. .95 

f. .975 

g. .99 

11. Excel inclndes a fnnction to calcnlate 
the probability for a random variable 
coming from any normal distrihntion. 
The syntax of the fnnction is 

NORMDIST(x, mean, std_dev, 
cumulative) 

where x is the valne of the random vari¬ 
able, mean is the mean of the normal 
distrihntion, std_dev is the standard 
deviation of the distrihntion, and 
cumulative is TRUE to calcnlate the 
cnmnlative valne of the probability 
fnnction and FALSE to calcnlate the 
pdf for the specified x valne. Use this 
fnnction to calcnlate the cnmnlative 
probabilities for the following valnes: 

a. X = 1.96, mean = 0, std_dev = 1 

b. X = 1.96, mean = 0, std_dev = 0.5 

c. X = 1.96, mean = 0, std_dev = 0.25 

d. X = —1.96, mean = 0, std_dev = 1 

e. X = 5, mean = 5, std_dev = 2 

12. The Excel fnnction 

NORMINV(proh, mean, std_dev) 

calculates the inverse of the normal 
distrihntion. For a cnmnlative prob¬ 
ability of prob, a mean valne of mean, 
and a standard deviation of std_dev, the 
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function returns the value of x. Use this 
function to calculate the x values for the 
following: 


a. mean = 

b. mean = 

c. mean = 

d. mean = 

e. mean = 

f. mean = 


5, std_dev = 
5, std_dev = 
5, std_dev = 
5, std_dev = 
5, std_dev = 
5, std_dev = 


2, prob 
2, prob 
2, prob 
2, prob 
2, prob 
2, prob 


.10 

.20 

.50 

.90 

.95 

.99 


13. Open the Baseball workbook from the 
Chapter05 folder. You want to analyze 
the batting average statistics from the 
workbook. The mean career batting aver¬ 
age is .263 and the standard deviation is 
.02338. 

a. Open the workbook and save it as 
Batting Average Analysis. 

b. Assuming that the batting averages 
are normally distributed, use Excel’s 
NORMDIST function to find the prob¬ 
ability that a player will bat .300 or 
better. [Hint: Calculate 1 — probability 
that a player will bat less than .300.) 

c. How many players batted .300 or 
better? Compare this to the expected 
number. 

d. Save your workbook and summarize 
your findings. 


14. The Housing workbook contains a 
sample of 117 housing prices for 
Albuquerque, New Mexico, during the 
early 1990s. You’ve been asked to 
analyze this historic data set. 

a. Open the Housing workbook from 
the Chapter05 folder and save it as 
Housing Price Analysis. 

b. Create a histogram (with a normal 
curve) and a normal probability 
plot of the housing prices. Do the 
data appear to follow a normal 
distribution? 


c. Calculate the average housing price 
and the standard deviation of the 
housing price. Because this is a 
sample of all of the house prices in 
Albuquerque, the average serves as an 
estimate of the mean house price. 

d. Create a new column containing the 
log of the home price values. Create a 
histogram with a normal curve and a 
normal probability plot. Modify the 

X axis label Number format to display 
the log values to three decimal places. 
Do the transformed values appear 
more normally distributed than the 
untransformed values? 

e. Save your changes to the workbook. 
Write a report summarizing your 
observations and calculations. 


15. The dispersion of shots used in shoot¬ 
ing at the target in the Random Samples 
workbook follows a bivariate normal 
distribution (a combination of two nor¬ 
mal distributions, one in the vertical 
direction and one in the horizontal 
direction). The value of a for each level 
of accuracy is 


Accuracy 

Highest 

Good 

Moderate 

Poor 

Lowest 


Standard Deviation 

0.1 

0.2 

0.4 

0.6 

1.0 


a. Open the Random Samples workbook 
from the Explore folder and create a 
distribution of shots around the target 
with good accuracy. 

b. Explain why the predicted percent¬ 
ages have the values they have. 

c. For a shooter with the lowest ac¬ 
curacy, how many shots would the 
person have to take before she or he 
could assume with 95% confidence 
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that the average horizontal location of 
her or his shots was within 0.2 nnit of 
the hnll’s eye? 

d. How many shots wonld a shooter 
with the highest accnracy have to take 
before achieving similar confidence 
in the average placement of his or her 
shots? 

16. Stndy the properties of the Poisson dis- 
trihntion hy doing the following: 

a. Open a blank workbook and, nsing 
the Create Random Data command 
from StatPlns, create 16 colnmns of 
100 rows of Poisson random valnes 
with A = 0.25. 

b. Create another colnmn in yonr work¬ 
book containing the average valnes 
from those colnmns. 

c. Create a histogram of the first col¬ 
nmn of random Poisson valnes with 
a snperimposed normal cnrve. 

Create a second histogram of the 
colnmn averages with a snperim¬ 
posed normal curve. Compare the 
two curves. 

d. Calculate the average and standard 
deviation of the two columns. 

e. How do these calculated values com¬ 
pare to the value of A? See Exercise 7 
for more information on the Poisson 
distribution. 

f. Save your workbook as Poisson 
Distribution Analysis to the Chapter05 
folder. 

17. Repeat questions a. through d. of 
Exercise 16 with 16 columns of 100 
rows of binomial random values where 
the number of trials is 16 and the value 
of p is .25. How do the averages in part d 
compare with the value p = .25? What 
about the standard deviations computed 
in part d? Save your workbook as 
Binomial Distribution Analysis to the 
Chapter05 folder. 


18. Repeat questions a. through d. of Exer¬ 
cise 16 with 16 columns of 100 rows 
of Bernoulli random values where 

p = .25. How do the averages in part d 
compare with the value p = .25? What 
about the standard deviations com¬ 
puted in part d? Save your workbook as 
Bernoulli Distribution Analysis to the 
Chapter05 folder. 

19. Repeat questions a. through d. of Exer¬ 
cise 16 with 16 columns of 100 rows of 
uniform random values where the lower 
boundary is 0 and the upper boundary 
is 100. Save the workbook as Uniform 
Distribution Analysis to the Chapter05 
folder. 

20. True or false-. According to the Central 
Limit Theorem, as the size of the sample 
increases, the distribution of the obser¬ 
vations approximates a normal distribu¬ 
tion. Defend your answer. 

21. You want to collect a sample of values 
from a uniform distribution where p is 
unknown but (7 = 10. 

a. How large a sample would you need 
to estimate the value of p within 

2 units with a confidence level of 
95%? 

b. How large a sample would you need 
to estimate the value of p within 

2 units with a confidence level of 99% ? 

c. If the sample size is 25 and p is 50, 
what is the probability that the sam¬ 
ple average will have a value of 48 
or less? 

22. At the 1996 Summer Olympic games in 
Atlanta, Linford Christie of Great Britain 
was disqualified from the finals of the 
men’s 100-meter race because of a false 
start. Christie did not react before the 
starting gun sounded, but he did react in 
less than 0.1 second. 
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According to the rules, anyone who re¬ 
acts in less than a tenth of a second must 
have false-started hy anticipating the 
race’s start. Christie bitterly protested 
the ruling, claiming that he had just 
reacted very quickly. Using the reaction 
times from the first heat of the men’s 
100-meter race, try to weigh the merits 
of Christie’s claim versus the argument 
of the race officials that no one can react 
as fast as Christie did without anticipat¬ 
ing the starting gun. 

a. Open the Reaction workbook from 
the ChapterOS folder and save it as 
Reaction Time Analysis. 

b. Create a histogram of the reaction 
times. Where would a value of 0.1 
second fall on the chart? 

c. Calculate the mean and standard 
deviation of the first heat’s reaction 


times. Use these values in Excel’s 
NORMDIST function and calculate the 
probability that an individual would 
record a reaction time of 0.1 or less. 

d. Create a normal probability plot of the 
reaction times. Do the data appear to 
follow the normal distribution? 

e. Save your workbook and write a re¬ 
port summarizing your conclusions. 
Include in your summary a discus¬ 
sion of the difficulties in determining 
whether Christie anticipated the start¬ 
er’s gun. Are the data appropriate for 
this type of analysis? What are some 
limitations of the data? What kind of 
data would give you better informa¬ 
tion regarding a runner’s reaction 
times to the starter’s gun (specifically, 
runners taking part in the finals of an 
Olympic event]? 
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Chapter 6 


Statistical Inference 



Objectives 


In this chapter you will learn to: 
P- Create confidence intervals 
P- Apply a hypothesis test 
p- Use the t distrihution in a hypothesis test 
P- Perform a one-sample and a two-sample t test 
P- Analyze data using nonparametric approaches 
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T he concepts you learned in Chapter 5 provide the basis for the subject 
of this chapter, statistical inference. Two of the main tools of statistical 
inference are confidence intervals and hypothesis tests. In this chapter, 
you’ll apply these tools to reach conclusions about your data. You’ll be 
introduced to a new distribution, the t distribution, and you’ll see how to use 
it in performing statistical inference. You’ll also learn about nonparametric 
tests that make fewer assumptions about the distribution of your data. 


Confidence Intervals 


In the previous chapter, you learned two very important facts about distri¬ 
butions and samples. 

1. A sample average will approximately follow a normal distribution with 
mean p, and standard deviation n/Vn, where p is the mean of the proba¬ 
bility distribution the sample is drawn from, cr is the standard deviation 
of the probability distribution, and n is the size of the sample. Another 
way of writing this is 



2. In a normal distribution, about 95% of the time, the values fall within 2 
standard deviations of the mean. 

From these two facts, we can calculate how precisely the sample average 
estimates the value of p. For example, if cr = 10 and our sample size is 25, the 
sample average will approximately follow a normal distribution with mean 
p and standard deviation 2, so 95% of the time, the sample average will fall 
within 4 units of p. This indicates that if the sample average is 20, we could 
construct a confidence interval from about 16 to 24 that should, with 95% 
confidence, “capture” the value of p. If we want this confidence interval to 
be smaller, we simply increase the sample size. A sample of 100 observations 
would result in a 95% confidence interval for p ranging from about 18 to 22. 

The use of the 2 standard deviations rule is an approximation. What if we 
wanted a more exact estimate of the 95% confidence interval, or what if we 
wanted to construct other confidence intervals, such as a 99% confidence 
interval? How would we go about doing that? 


zTest Statistic and z^Values 


In order to derive a more general expression of the confidence interval, we 
first have to express the sample average in terms of a standard normal dis¬ 
tribution. We can do this by subtracting the value of p and dividing by the 
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standard error. The calculated value will then follow a standard normal dis- 
trihution; that is, 



This value is called a z test statistic. We can then compare the z test sta¬ 
tistic to z values taken from a standard normal distrihution. A z value, usu¬ 
ally written as Zp, is the point z on a standard normal curve such that for 
random variable Z, P{Z < Zp) = p. For example, Zg gg = 1.645 because 95% 
of the area under the curve is to the left of 1.645. See Figure 6-1. 


Figure 6-1 
The z value 



30 25 20 1.5 1.0 •0.5 0.0 0.5 1.0 1.5 20 25 30 


Zo 95 - 1.645 


Figure 6-1 shows a one-sided z value, but for confidence intervals, we’re 
more interested in a two-sided z value, where p is the probability of the 
value falling in the center of the distribution and a (which equals 1 — p) 
is the probability of its falling in one of the two tails. For a two-sided range 
of size p, these z values are —and In other words, for a ran¬ 
dom variable Z, P(—Zi_q ,/2 < Z < = 1 — a = p. If we want to find 

the central 95% of the standard normal curve, p = 0.95, a = 0.05, and 
^- 0 . 05/2 = %.975 = 1.96. This means that 95% of the values on a standard 
normal curve lie between —1.96 and 1.96. See Figure 6-2. 
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Figure 6-2 
Two-sided 
z values 



We can use a two-sided z value to construct a general expression for the 
confidence interval. The more general expression is 


P X 


—n 


Vn 


< /LL < X + Zi_Q 


^Vn 


= 1 


Thus the upper and lower confidence limits for p, are x ± Zi_Q/ 20 '/Vn. For 
example, if a = 0.05, then Zi_o Qg /2 = 1.96 and the 95% confidence limits 
are x ± 1.96 X n/Vn, which is pretty close to our rule-of-thumh estimate of 
± 2 standard errors from the sample average. Table 6-1 shows confidence 
intervals of various sizes using this approach. 


Table 6-1 

Confidence Intervals 




1 — a 

^l-a/2 

Confidence Band 


0.800 

1.282 

X ± 1.282 X o-/Vn 


0.900 

1.645 

X ± 1.645 X o-/Vn 


0.950 

1.960 

X ± 1.960 X o-/Vn 


0.990 

2.576 

X ± 2.576 X o-/Vn 


0.999 

3.290 

X ± 3.290 X o-/Vn 
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For example, if you want to construct a confidence interval aronnd the 
sample average that will captnre the valne of /r 99.9% of the time, calcnlate 
the sample average ± 3.3 times the standard error. Admittedly, this will 
tend to he a very large interval. 

EXCEL TIPS_ 

^ - •To calcnlate the valne of Zi-a /2 with Excel, nse the function 
NORMSINV(x], where x = 1 — all. 

• To find the prohahility associated with a z test statistic, nse the 
fnnction NORMSDIST(z), where z is the z test statistic. 


Calculating the Confidence Interval with Excel 


Yon can nse Excel’s fnnctions to calcnlate a confidence interval if yon know 
the standard deviation of the nnderlying prohahility distrihntion. For exam¬ 
ple, snppose yon are condncting a snrvey on the cost of a medical procednre 
as part of research on health care reform. The cost of the procednre follows 
the normal distrihntion, where a = 1,000. After sampling 50 different hos¬ 
pitals at random, yon calcnlate the average cost to he $5,500. What is the 
90% confidence interval for the valne of /r—the mean cost of all hospitals? 
(That is, how far above and helow $5,500 mnst yon go to say, “I’m 90% con¬ 
fident that the mean cost of this procednre lies in this range’’?) 


1 

2 

3 

4 

5 

6 

7 


To calculate the 90% confidence interval: 

Start Excel and open a blank workbook. 

Type Average in cell Al, Std. Error in cell Bl, Alpha in cell Cl, 
Lower in cell Dl, and Upper in El. 

Click cell A2 and type 5500 (the observed sample average). 

Type =1000/sqrt(50) in cell B2. This is the standard error of the 
sample average. 

Type 10 % in cell C2. This is the alpha valne for yonr confidence 
interval. 

Type =A2—B2*NORMSI]W(l-C2/2) in cell D2. Note that we nse the 
NORMSINV(0.95) fnnction to retnrn the z valne from the standard 
normal distrihntion. 

Type =A2-l-B2*NORMSINV(l-C2/2) in cell E2. Fignre 6-3 shows the 
resnlting 90% confidence interval. 
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Figure 6-3 
Using Excel 
to calculate 
a confidence 
interval 
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Close your workbook. You do not have to save the changes. 


Excel retnrns a 90% confidence band ranging from $5,267.38 to 
$5,732.62. If yon were trying to estimate the mean cost of this procednre for 
all hospitals, yon conld state that yon were 90% confident that the cost was 
not less than $5,267.38 or more than $5,732.62. 


Interpreting the Confidence Interval 

It’s important that yon nnderstand what is meant by statistical confidence. 
When yon calcnlated the confidence interval for the cost of the hospital pro¬ 
cednre, yon were not stating that the probability of the mean cost falling be¬ 
tween $5,267.38 and $5,732.62 was .90. That wonld incorrectly imply that 
the range yon calcnlated and the mean cost are random variables. They’re 
not. After drawing a specific sample and from that sample calcnlating the 
confidence interval, we’re no longer working with random variables bnt 
with actnal observations. The mean cost is also some fixed (bnt nnknown) 
nnmber and is not random. The term confidence refers to onr confidence 
in the procednre we nsed to calcnlate the range. The term 90% confident 
means that we are confident that onr procednre will captnre the valne of /r 
90% of the times it is nsed. 



CONCEPT TUTORIALS 

The Confidence Interval 


To get a visnal pictnre of the confidence interval in action, yon can nse the 
Explore workbook named Confidence Intervals to read abont, and work 
with, a confidence interval. 


To use the Confidence Intervals workbook: 

□pen the Confidence Intervals workbook located in the Explore 
folder. Enable any macros in the workbook. 

Move throngh the sheets in this workbook, viewing the material on 
z valnes and confidence intervals. 
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Click Explore the Confidence Interval from the Table of Contents 
column. See Figure 6-4. 


Figure 6-4 
The 

Confidence 

Intervals 

workbook 


true mean value- 
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-size of the 
confidence 
interval 


confidence intervals that 
don’t capture the true mean 


drag to change the 
size of the 
confidence interval 


The worksheet in Figure 6-4 shows 100 simulated confidence intervals 
taken from a normal distribution with /r = 50 and cr = 4. Each of the 100 
samples contains 25 observations, so the standard error of each sample aver¬ 
age is 0.8. 

If a confidence interval captures the true value of /r, it shows up on the 
chart as a vertical green line. If a confidence interval does not include /r, it 
shows up as a vertical red line. You would expect that 95 of the 100 samples 
would include the value of /r in their confidence intervals. Because there 
is some random variation, Figure 6-4 shows that only 94% of the sample 
confidence intervals include the value of /r. Using this worksheet, you can 
generate a new random sample or change the width of the confidence band. 
Try this now by reducing the confidence interval from 95 to 75%. 

To reduce the size of the confidence interval: 

I Drag the vertical scroll bar up until the value 75.0% appears in the 
highlighted Confidence Interval box. See Figure 6-5. 
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Figure 6-5 
75% confidence 
intervals 



intervals capture the true mean 


By reducing the confidence interval to 75%, yon’ve rednced the width 
of the confidence hand, hnt at the cost of many more red lines appearing on 
your chart. If you were relying upon this confidence interval to captnre the 
valne of /r, you would run a great risk of making an error. Let’s go the other 
way and increase the confidence interval. 

To increase the size of the confidence interval: 

I Drag the vertical scroll har down until the value 99.0% appears in 
the Confidence Interval hox. 


All of the confidence intervals now captnre the value of /r, hut the size of 
the confidence hands has greatly increased. As yon can see, there is a trade¬ 
off in nsing the confidence interval. Selecting too small a valne conld result 
in missing the valne of /r. Selecting a larger value will almost certainly cap¬ 
tnre fjb, hnt at the expense of having a range of valnes too hroad to he nseful. 
Statisticians have generally favored the 95% confidence interval as a com¬ 
promise between these two positions. 

An important lesson to learn from this simulation is to not take the sample 
average at face valne. Confidence intervals help you quantify how precisely 
the sample average estimates the value of /r. The next time you hear in the 
news that a stndy has shown that a drug causes a mean decrease in hlood 
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pressure or that polls predict a certain election result, you should ask, “And 
what is the confidence interval?” 

If yon want to generate a new set of random samples, yon can click the 
Generate Random Samples hntton in the Confidence Intervals workbook. 
Continne exploring the workbook nntil yon understand the relationship 
among the confidence interval, the sample average, and the valne of /r. Close 
the workbook when yon’re finished. Yon do not have to save yonr changes. 


Hypothesis Testing 

Confidence intervals are one way of performing statistical inference; another 
way is hypothesis testing. In a hypothesis test, yon formnlate a theory abont the 
phenomenon yon’re stndying and examine whether that explanation is snp- 
ported by the statistical evidence. In statistics, we formulate a hypothesis first, 
then collect data, and then perform a statistical test. The order is important. If 
we formnlate onr hypothesis after collecting the data, we rnn the risk of hav¬ 
ing a biased test, becanse onr hypothesis might be designed to fit the data. To 
gnard against a biased test, the hypothesis shonld be tested on a new set of data. 
Fignre 6-6 displays a classical approach to developing and testing a theory. 


Figure 6-6 
The steps in 
developing 
and testing a 
hypothesis 
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There are four elements in a hypothesis test: 

1. A nnll hypothesis, Hp 

2. An alternative hypothesis, 

3. A test statistic 

4. A rejection region 

The null hypothesis, nsnally labeled Hp, represents the defanlt or statns 
qno theory ahont the phenomenon that you’re studying. Yon accept the nnll 
hypothesis as trne nnless you have convincing evidence to the contrary. 
The alternative hypothesis, or H^, represents an alternative theory that is 
antomatically accepted as true if the null hypothesis is rejected. Often the 
alternative hypothesis is the hypothesis yon vrant to accept. For example, a 
nev\r medication is being stndied that claims to rednce blood pressnre. The 
nnll hypothesis is that the medication does not affect the patient’s blood 
pressnre. The alternative hypothesis is that the medication does affect the 
patient’s blood pressnre (in either a positive or a negative direction). 

The test statistic is a statistic calcnlated from the data that yon nse to de¬ 
cide whether to reject or to accept the nnll hypothesis. The rejection region 
specifies the set of values of the test statistic under which you’ll reject the 
null hypothesis (and accept the alternative). 


Types of Error 

We can never be sure that our conclusions are free from error, but we can try 
to reduce the probability of error. In hypothesis testing, we can make two 
types of errors: 

1. Type I error: Rejecting the null hypothesis when the null hypothesis 
is true 

2. Type II error: Failing to reject the null hypothesis when the alternative 
hypothesis is true 

The probability of Type I error is denoted by the Greek letter a, and the 
probability of Type II error is identified by the Greek letter 15. 

Generally, statisticians are more concerned with the probability of Type 
I error, because rejecting the null hypothesis often results in some funda¬ 
mental change in the status quo. In the blood pressure medication example, 
incorrectly accepting the alternative hypothesis could result in prescribing an 
ineffective drug to thousands of people. Statisticians will set a limit, called 
the significance level, that is the highest probability of Type I error allowed. 
An accepted value for the significance level is 0.05. This means we set up a 
region that has probability .05 if the null hypothesis is true, and we reject 
if the data fall in this region. 

Reducing Type II error becomes important in the design of experiments, 
where the statistician wants to ensure that the study will detect an effect if a 
true difference exists. An analysis of the probability of Type II error can aid 
the statistician in determining how many subjects to have in the study. 
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An Example of Hypothesis Testing 


Let’s put these abstract ideas into a concrete example. Yon work at a plant 
that mannfactnres resistors. Previons stndies have shown that the nnmber of 
defective resistors in a batch follows a normal distribntion with a mean of 50 
and a standard deviation of 15. A new process has been proposed that will 
rednce the nnmber of defective resistors, saving the plant money. Yon pnt 
the process in place and create a sample of 25 batches. The average nnmber 
of defects in a batch is 45. Does this prove that the new process rednces the 
nnmber of defective resistors, or is it possible that the process makes no dif¬ 
ference at all, and the 45 is simply a random aberration? 

Here are onr hypotheses. 

Hg! There is no change in the mean nnmber of defective resistors nnder 
the new process. 

The mean nnmber of defective resistors has changed. 

Or, eqnivalently, 

Hgi The mean nnmber of defective resistors in the new process is 50. 

H : The mean nnmber of defective resistors is not 50. 

a 

Acceptance and Rejection Regions 

To decide between these two hypotheses, we assnme that the nnll hypoth¬ 
esis is trne. Let /Tq be the mean nnder the nnll hypothesis. This means that 
nnder the nnll hypothesis, 

f X - (Jig \ 

Mnltiplying by the standard error and adding /Tq to each term in the ineqnal- 
ity, we get 


P < X < /Tq -t 1 Q 

This means that the sample average shonld be in the range /Tq ± Zi_q/2 tr/Vn 
with probability 1 — a, if the null hypothesis is true. Now let a be onr sig¬ 
nificance level, so that if the sample average lies ontside this range, we’ll 
reject the nnll hypothesis and accept the alternative. These ontside valnes 
wonld constitnte the rejection region, mentioned earlier. The valnes within 
the range constitnte the acceptance region, nnder which we’ll accept the 
nnll hypothesis. The npper and lower bonndaries of the acceptance region 
are known as critical values, because they are critical in deciding whether 
to accept or to reject the null hypothesis. 
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Let’s apply this formula to our example: fig = 50, cr = 15, n = 25, and we 
set a = 0.05 so that the prohahility of Type I error is 5%. The acceptance 
region is therefore 

= 50 ± 1.96 X 15/\/^ 

= 50 ± 5.88 
= (44.12,55.88) 

Any value that is less than 44.12 or greater than 55.88 will cause us to reject 
the null hypothesis. Because 45 falls in the acceptance region, we accept 
the null hypothesis and do not conclude that the new process decreases the 
number of defective resistors in a hatch. 


p Values 

The p value is the prohahility of a value as extreme as the observed value. 
We can calculate that by examining the z test statistic. For the manufactur¬ 
ing example, the z test statistic is 

^ ~ Mo 

_ 45 - 50 
15/\/25 
= -1.67 

The probability of a standard normal value of less than —1.67 is 0.0478. 
To calculate the p value, we need to take into account the terms of the al¬ 
ternate hypothesis. In this case, the alternative hypothesis was that the new 
process made no difference (either positive or negative) in the number of 
defects. Thus, we need to calculate the probability of an extreme value 1.67 
units from 0 in either direction. Because the standard normal distribution 
is symmetric, the probability of a value being < —1.67 is equal to the prob¬ 
ability of a value being > 1.67, so we can simply double the probability, 
resulting in a p value of 2 X 0.0478 = 0.0956. 

This was an example of a two-tailed test, in which we assume that extreme 
values can occur in either direction. We can also construct a one-tailed test, 
in which we consider differences in only one direction. A one-tailed test 
could have these hypotheses. 

Hq! The mean number of defective resistors in the new process is 50. 

H : The mean number of defective resistors is < 50. 

a 

We use a one-tailed test if something in the new process would absolutely 
rule out the possibility of an increase in the number of defective resistors. 
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If that were the case, we would not need to donhle the prohahility, and the 
p valne wonld he 0.0478, so the sample average lies ontside the acceptance 
region if a = 0.05. We wonld call this resnlt statistically significant and 
wonld reject the nnll hypothesis, accepting the hypothesis that the new pro¬ 
cess rednces the nnmher of defective resistors. 

It sonnds like we’ve got something for nothing, hnt we haven’t. We’ve at¬ 
tained significant resnlts at the cost of assnming something that we hadn’t 
assnmed before. Becanse it’s easier to achieve “significant” resnlts in one- 
tailed tests, they shonld he nsed with extreme cantion and only when war¬ 
ranted hy the sitnation. You should always state your alternative hypothesis 
before doing yonr analysis (rather than deciding on a one-tailed test after 
seeing the resnlts with the two-tailed test). 

EXCEL TIPS_ 

^ • To calcnlate the p valne with Excel, first calcnlate the valne of 

the z test statistic nsing the Excel fnnction z=(AVERAGE(c/ata 
range)—/rQ)/(o-/SQRT(n)), where data range is the range of cells 
in yonr worksheet containing the sample valnes, /Aq is the mean 
nnder the nnll hypothesis, tr is the standard deviation of the 
prohahility distrihntion, and n is the sample size. 

• For a one-tailed test where z is negative, the p valne = 
NORMSDIST(z). 

• For a one-tailed test where z is positive, the p valne = 

1 - NORMSDIST(z). 

• For a two-tailed test where z is negative, the p valne = 

2 X NORMSDIST(z). 

• For a two-tailed test where z is positive, the p valne = 

2 X (1 - NORMSDIST(z)). 



CONCEPT TUTORIALS 

Hypothesis Testing 


Yon can get a visnal pictnre of the principles of hypothesis testing hy open¬ 
ing the Hypothesis Testing workbook. 

To use the Hypothesis Testing workbook: 

1 Open the Hypothesis Testing workbook, located in the Explore 
folder. Enable any macros in the workbook. 

2 Move throngh the workbook, reviewing the material on hypothesis 
testing. 
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Click Explore Hypothesis Testing from the Table of Contents colnmn. 
See Fignre 6-7. 


Figure 6-7 
The 

Hypothesis 

Testing 

workbook 



click 

highlighted 
boxes to 
change 
conditions 


The workbook shows the sampling distribntion nnder the nnll hypoth¬ 
esis, where it is assnmed that /r = 50. The rejection region is displayed on 
the chart in black. This workbook allows yon to vary fonr valnes—cr, x, the 
sample size, and a —to see their effect on the hypothesis test. Yon can also 
choose whether to perform a one-tailed or a two-tailed test. The resnlts of 
yonr choices are antomatically displayed on the workbook and in the chart. 
By working with different valnes of these factors, yon can get a clearer pic- 
tnre of how hypothesis testing works. 

For example, what impact wonld donbling the sample size have on the 
hypothesis test, assnming all other factors remained the same? Let’s find ont. 

To increase the sample size: 

I Click the Sample Size box, and change the sample size from 25 to 50. 

See Fignre 6-8. 
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Figure 6-8 
Changing 
the sample 
size from 
25 to 50 
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you reject the null 
hypothesis with the 
increased sample size 


-increasing the sample size decreases the spread of 
the distribution around the sample average 


sample 
size = 50 


The width of the acceptance region shrinks from 11.76 (44.12 to 55.88) 
with a sample size of 25 to 8.32 (45.84 to 54.16) with a sample size of 50. 
The observed sample average lies within the rejection region, so yon reject 
the nnll hypothesis with a p valne of 1.84%. 

Now let’s see what happens when yon increase the valne of a from 15 to 20. 


I 


To increase the value of a: 

Click the Population Sigma hox, and change the valne from 15 to 20. 

Becanse the valne of a has increased, the valne of the standard error 
has increased too, from 2.121 to 2.828. The lower critical valne has 
fallen to 44.456 and the p valne has increased to 7.71%, so yon do 
not reject the nnll hypothesis. Variability is one of the most impor¬ 
tant factors in hypothesis testing; mnch of statistical analysis is con¬ 
cerned with redneing or explaining variability. 

Finally, let’s find ont what onr conclnsions wonld be if we nsed a 
one-tailed test, where is the hypothesis that the mean < 50. 

To change to a one-tailed test: 

Click the Ha: Mean < 50 (1-tailed) option bntton. 

The chart changes to a one-tailed test. See Fignre 6-9. 
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Figure 6-9 
Switching 
from a 
two-tailed 
to a 
one-tailed 
test 



region 


Under a one-tailed test, we reject the null hypothesis. Of course, we have 
to he very careful with this approach, because we are changing our hypoth¬ 
esis after seeing the data. If this were an actual situation, changing the hy¬ 
pothesis like this would he inappropriate. The better course would be to 
draw another random sample of 25 batches and test the new hypothesis on 
that set of data (and only if we have compelling reasons for doing a one- 
tailed test). 

Try other combinations of hypothesis and parameter values to see how 
they affect the hypothesis test. Close the workbook when you’re finished. 
You do not have to save your changes. 


Additional Thoughts about Hypothesis Testing 

One important point you should keep in mind when hypothesis testing is 
that accepting the null hypothesis does not mean that the null hypothesis is 
true. Rather, you are stating that there is insufficient reason to reject it. The 
distinction is subtle but important. To state that accepting the null hypoth¬ 
esis means that /r = 50 excludes the possibility that /r actually equals 49 or 
49.9 or 49.99. But you didn’t test any of these possibilities. What you did 
test was whether the data are incompatible with the assumption that /r = 50. 
You found that in some cases, they are not compatible. 
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You’ve looked at tw^o approaches to statistical inference: the confidence 
interval and the hypothesis test. For a particular value of a, the width of the 
confidence interval around the sample average is equal to the width of the 
two-sided acceptance region around /Tq. This means that the following two 
statements imply each other: 

1. The value /Tq, lies outside the (l — a) confidence interval around x. 

2. Reject the null hypothesis that /r = /Tq at the a significance level. 


The t Distribution 

Up to now, you’ve heen assuming that the value of a is known. What if 
you didn’t know the value of tr? One solution is to substitute the standard 
deviation of the sample values, s for a in the hypothesis-testing equations. 
However, there are problems with this approach. If s underestimates a, then 
you’ll overestimate the significance of the results, perhaps causing you to 
reject the null hypothesis falsely. Or if s overestimates the value of cr, you 
could accept the null hypothesis when the null hypothesis isn’t true. 

Early in the twentieth century, William Cosset, working at the Guinness 
brewery in Ireland, became worried about the uncertainty caused by substi¬ 
tuting s for a. He believed that the resulting error could be especially bad for 
small sample sizes. What Cosset discovered was that when you substitute s 
for (7, the ratio 


X — jJu 


s/ 


Vn 


does not follow the standard normal distribution; rather, it follows a distri¬ 
bution called the t distribution. 

The t distribution is a probability distribution centered around zero and 
characterized by a single parameter called the degrees of freedom, which is 
equal to the sample size, n, minus 1. For example, if the sample size is 20, 
the degrees of freedom equal 19. The t distribution is similar to a standard 
normal distribution except that it has heavier tails. As the sample size in¬ 
creases, the t distribution approaches the standard normal, but for smaller 
sample sizes there can be big differences. 
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CONCEPT TUTORIALS 

The t Distribution 


To see how the t distrihution differs from the standard normal, open the 
Distrihntions workbook. 


To explore the t distribution: 

1 Open the Distributions workbook, located in the Explore folder. 

2 Click t from the Table of Contents colnmn. Review the material and 
scroll to the bottom of the worksheet. 

3 Click the Spin Arrow bnttons located next to the Degrees of free¬ 
dom box to change the degrees of freedom for the t distribntion. 
See Fignre 6-10. 


Figure 6-10 
The 

distribution 



as the degrees of freedom increase, the t distribution more 
closely approximates the normal distribution 


The workbook opens, displaying the t distribntion with 1 degree of freedom. 
Notice that the t distribntion has heavier tails than the snperimposed standard 
normal distribntion. As yon increase the degrees of freedom, the t distribntion 
more closely approximates the standard normal. 

Continne exploring the t distribntion by changing the degrees of freedom 
and observing the changing cnrve. Close the workbook when yon’re fin¬ 
ished. Yon do not have to save yonr changes. 
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Working with the t Statistic 


Table 6-2 


Excel provides several functions to work with the t distrihntion. Two of 
these are displayed in Table 6-2. 

Two t distribution functions in Excel 


Function Description 

TDISTft, df, tails) Returns the p value for the t distrihntion for a given 
value of t, degrees of freedom df, and tails = 1 (one 
tailed) or tails = 2 (two tailed) 

TINV(p, df) Returns the two-tailed t value from the t distrihntion 

with degrees of freedom dfioi a p value = p. For a 
one-tailed t value, replace p with 2 X p. 


Let’s use Excel to apply the t distribution to a problem involving textbook 
prices. The college administration claims that students should not expect 
to spend more than an average of $500 each semester for books. A student 
associated with the school newspaper decides to investigate this claim and 
interviews 25 randomly selected students. The average spent by the 25 stu¬ 
dents is $520, and the standard deviation of these purchases is $50. Is this 
significant evidence that the statement from the administration is wrong? 

First, let’s construct our hypotheses. 

Hq! The average cost (p.^,) of textbooks is $500. 

H^: The average cost of textbooks is not equal to $500. 

Now we construct the t statistic f„_i. 

X - Pq 520 - 500 20 

^ 50/V^ ^10^^ 

To test the null hypothesis with Excel: 

1 Open a new blank workbook. 

2 In cell Al, type =TDIST(2,24,2) and press Enter. 

In this example, 2 is the value of the t statistic, 24 is the degrees of 
freedom, and we enter 2 because this is a two-tailed test. 


The TDIST function returns a p value of 0.05694, so we do not reject the 
null hypothesis at the 5% level. Thus we conclude that there is not suffi¬ 
cient evidence that the college administration is underestimating the price 
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of textbooks. If we had used the z test statistic rather than the t statistic 
in this example, the p valne wonld have been 0.0455, and we wonld have 
erroneonsly rejected the nnll hypothesis. 


Constructing a t Confidence Interval 

Still we have a sample average that doesn’t completely match the adminis¬ 
tration’s claim. Let’s constrnct a 95% confidence interval for the mean valne 
to see in what range of valnes the trne mean might lie. Becanse we don’t 
know the valne of cr, we can’t nse the confidence interval eqnation discnssed 
earlier; we’ll have to nse one based on the t distribntion. The expression for 
the t confidence interval is 


Here, is the point on the t distribntion with n — 1 degrees of 

freedom, snch that the probability of a t random variable being less than it is 
1 — a/2. To calcnlate this valne in Excel, yon nse the TINV fnnction. How¬ 
ever, in the TINV fnnction, yon enter the valne of a, not 1 — a/2. For ex¬ 
ample, to calcnlate the valne of n-i> enter the fnnction TINVja, n—1). 
Use this information to construct a 95% confidence interval. 

To construct a 95% confidence interval for the price of textbooks: 

1 In cell A2, type =220—TINV(0.05,24)*50/SQRT(25) and press Tab. 

2 In cell B2, type =220-1-TINV(0.05,24)*50/SQRT(25) and press Enter. 


The 95% confidence interval is (199.36, 240.64). We do not expect the mean 
price of textbooks to be mnch less than $200, nor shonld it be mnch greater 
than $240. By comparison, the confidence interval based on the standard nor¬ 
mal distribntion is (200.40, 239.60), so the confidence intervals are very close 
in size. Notice that the t confidence interval inclndes 200, which means that 
200 is not rnled ont by the data, in agreement with onr hypothesis test. This 
agreement is not a coincidence, as can be shown with a little algebra. 


The Robustness of t 

When yon nse the t distribntion to analyze yonr data, yon’re assnming that 
the data follow a normal distribntion. What are the conseqnences if this tnrns 
ont not to be the case? The t distribntion has a property called robustness, 
which means that even if the assnmption of normality is moderately violated. 
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the p values returned by the t statistic will still be fairly accurate. As long as 
the distribntion of the data does not violate the assnmption of normality in an 
extreme way, yon can nse the t distribntion with confidence. 


Applying the tTest to Paired Data 

The t distribntion becomes nsefnl in analyzing paired data, where observa¬ 
tions come in natnral pairs and yon wish to explore the difference between 
two pairs. For example, a doctor might measnre the effect of a drng by mea- 
snring the physiological state of patients before and after administering the 
drng. Each patient in this stndy has two observations, and the observations 
are paired with each other. To determine the drng’s effectiveness, the doctor 
looks at the difference between the before and after readings. 

The Labor workbook contains data on the percentage of women in the 
workforce in 1968 and 1972 taken from a sample of 19 cities. The workbook 
contains the following variables: 


Table 6-3 Data on percentage of women in the workforce 


Range Name 

Range 

Description 

City 

A2:A20 

The name of the city 

Year_68 

B2:B20 

The percent of women in the workforce in 1968 

Year 72 

C2:C20 

The percent of women in the workforce in 1972 

Diff 

D2:D20 

The change in percentage from 1968 to 1972 for each city 


To open the Labor workbook: 

1 Open the Labor workbook from the ChapterOB folder. 

2 Save the file as Labor Analysis to the same folder. The workbook 
appears as shown in Fignre 6-11. 
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Figure 6-11 
The Labor 
workbook 



There are two observations from each city, and the observations consti- 
tnte paired data. Yon’ve been asked to determine whether this sample of 
19 cities demonstrates a statistically significant increase in the percentage 
of women in the workforce. Let /r be the mean change in the percentage of 
women in the workforce. Yon have two hypotheses. 

Hq! fjb = 0 (There is no change in the percentage from 1968 to 1972.) 

H^: iJi¥= 0 (There is some change, bnt we’re not assnming the direction 

of the change.) 

Yon can nse StatPlns to test these hypotheses and create a t confidence 
interval for the change in percent from 1968 to 1972. Do this now by analyz¬ 
ing the Diff variable to test whether the average difference is significantly 
different from zero. 

To test whether there is a significant difference: 

I Click One Sample Tests from the StatPlns menn and then click 

1 Sample t test. 

We nse the one-sample t test becanse we are essentially looking at 

one sample of data—the sample of paired differences. 
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2 Verify that the 1-sample t test option button is selected. 

3 Click the Input button and then click the Use Range Names option 
button. Select Diff from the list of range names and click the OK 
button. 

4 Click the Output button and select the New Worksheet option but¬ 
ton. Enter t test for the new worksheet name and click the OK but¬ 
ton. Figure 6-12 shows the completed dialog box. 

Note that we could have also selected the Paired t test (two columns) 
and then selected the Year_68 and Year_72 columns. Note also that 
the dialog box will test a null hypothesis of /r = 0 versus the alterna¬ 
tive hypothesis of not equal to 0. You can change these values if you 
wish to test other hypotheses. 

Construct a 95% 

Use values from a Use values from two confidence interval around 



5 


Click the OK button to generate the output from the t test. See 
Figure 6-13. 
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Figure 6-13 
t Test 
analysis 
of the 
labor data 



On the basis of our analysis, there is an average increase in the percentage 
of women in the labor force of 3.37 percentage points between 1968 and 
1972. This is statistically significant with a p valne of 0.024, so we reject 
the nnll hypothesis and accept the alternative. There has been a significant 
change in women’s participation in the workforce in those fonr years. The 
95% confidence interval for this estimate ranges from 0.49 percentage point 
np to 6.25 percentage points. Notice the interval does not inclnde 0 and the 
hypothesis test rejects 0, so the interval and the test agree. It is not hard to 
show that they will always agree. 

An economist, viewing other data on this topic, claims that the percentage 
of women in the workforce had actnally increased by 5 points from 1968 to 
1972. Do yonr data conflict with this statement? Let’s test the hypothesis. 

Hp! /r = 0.05 
/r A 0.05 

Rather than rernnning the StatPlns command, we can simply enter the 
new hypothesis directly into the t test worksheet, if we have StatPlns set to 
create dynamic ontpnt. Otherwise, we have to rernn the command. 

To test the new hypothesis: 

I Click cell D2, change the valne from 0 to 0.05, and press Enter. 
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The p value changes from 0.024 to 0.249. A p value of 0.249 is not small 
enough to reject the null hypothesis. We conclude that our data do not conflict 
w^ith the claim made hy this economist in a significant way. 

You can also change other values in the hypothesis test. You can switch 
to a one-sided test hy changing the value of cell D3 to either —1 or 1. You 
can also change the size of the confidence interval hy changing the value in 
cell D4. 

EXCEL TIPS_ 

^ • You can also perform a paired t test of your data using the Anal¬ 

ysis ToolPak, supplied hy Excel. To perform a paired t test, load 
the Analysis ToolPak and click the Data Analysis hutton from 
the Analysis group on the Data tah. In the Data Analysis dialog 
hox, click t Test: Paired Two Sample for Means and specify the 
two columns containing the paired data. This command does not 
calculate the confidence interval for you, so you have to calculate 
that using the formulas supplied in this chapter. 


You should not accept your analysis at face value without further inves¬ 
tigating the assumptions of the t test. One of these assumptions is that the 
data follow a normal distrihution. The t test is robust, hut that doesn’t mean 
you shouldn’t explore the possibility that the data are seriously nonnormal. 
Do this now by creating a histogram and normal probability plot of the data 
in the Diff column. 

To create a histogram of the difference data: 

1 Click Single Variable Charts from the StatPlus menu and then click 

Histograms. 

2 Click the Data Values button and select Diff from the list of range 
names in the workbook. 

3 Click the Normal Curve checkbox to add a normal curve to the 
histogram. 

4 Click the Output button and save your histogram to the Histogram 
chart sheet. 

5 Click the OK button twice to create the histogram. See Figure 6-14. 
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Figure 6-14 
Histogram of the 
difference data 



The observed difference values don’t appear to follow the normal curve 
particularly well. Let’s see whether the normal probability plot provides 
more information. 

To create a normal probability plot of the difference data: 

1 Click Single Variable Charts from the StatPlus menu and then click 

Normal P-plots. 

2 Click the Data Values button and select Diff from the list of range 
names in the workbook. 

3 Click the Output button and save your chart to the P-Plot chart 
sheet. 

4 Click the OK button twice to create the normal probability plot. See 
Figure 6-15. 
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From the two plots, there is enough graphical evidence to make us worry 
that the data do not follow the normal distrihution. This is a problem be¬ 
cause now we can’t feel completely comfortable about the p values the t test 
gave us. Because the assumption of normality may have been violated, we 
can’t be sure that the p value is accurate. 


Applying a Nonparametric Test to 
Paired Data 

A parametric test assumes a specific distribution such as the normal distri¬ 
bution, and the t test is an example. A nonparametric test does not assume 
a particular distribution for the data. Most nonparametric tests are based on 
ranks and not the actual data values (this frees them from assuming a par¬ 
ticular distribution). The study of nonparametric statistics can fill an entire 
textbook. We’ll just cover the high points and show how to apply a nonpara¬ 
metric test to your data. 

TheWilcoxon Signed RankTest 

The nonparametric counterpart to the t test is the Wilcoxon Signed Rank 
test. In the Wilcoxon Signed Rank test, we rank the absolute values of the 
original data from smallest to largest, and then each rank is multiplied by 
the sign of the original value ( — 1, 0, or 1). In case of a tie, we assign an 
average rank to the tied values. Table 6-4 shows the values of a variable, 
along with the values of the signed ranks. 
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Table 6-4 Signed Ranks 


Variable Values Signed Ranks 


18 

7.0 

4 

2.0 

15 

6.0 

-5 

-3.5 

-2 

-1.0 

10 

5.0 

5 

3.5 


There are seven values in this data set, so the ranks go from 1 (for low^est 
in absolute value) up to 7 (for the highest in absolute value). The low^est 
in absolute value is —2, so that observation gets the rank 1 and then is 
multiplied by the sign of the observation to get the sign rank value —1. The 
value 4 gets the sign rank value 2 and so forth. Two observations, —5 and 
5, are tied with the same absolute value. They should get ranks 3 and 4 in 
our data set, but because they’re tied, they both get an average rank of 3.5 
(or -3.5). 

Next we calculate the sum of the signed ranks. If most of the values were 
positive, this would be a large positive number. If most of the values were 
negative, this would be a large negative number. The sum of the signed ranks 
in our example equals 7 + 2 + 6 — 3.5 — 1 + 5 + 3.5 = 19. 

The only assumption we make with the Wilcoxon Signed Rank test is 
that the distribution of the values is symmetric around the median. If under 
the null hypothesis we assume that the median = 0, this would imply that 
we should have as many negative ranks as positive ranks and that the sum 
of the signed ranks should be 0. Using probability theory, we can then de¬ 
termine how probable it is for a collection of 7 observations to have a total 
signed rank of 19 or more if the null hypothesis is true. Without going into 
the details of how to make this calculation, the p value in this particular 
case is 0.133, so we would not reject the null hypothesis. In addition to 
calculating p values, you can also calculate confidence intervals using the 
Wilcoxon test statistic. 

One advantage in using ranks instead of the actual values is that the 
hypothesis test is much less sensitive to the effect of outliers. Also, non- 
parametric procedures can be applied to situations involving ordinal data, 
such as surveys in which subjects rank their preferences. The downside 
of nonparametric tests is that they are not as efficient as parametric tests 
when the data are normally distributed. This means that for normal data 
you need a larger sample size in order to detect statistically significant 
effects (5% larger when the Wilcoxon Signed Rank test is used in place 
of the t test). Of course, if the data are not normally distributed, you can 
often detect statistically significant effects with smaller sample sizes using 
nonparametric procedures. The nonparametric test can be more efficient 
in those cases. 
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The bottom line is that if there is some question about whether to use a 
parametric or a nonparametric test, do the analysis both ways. 

Excel does not include any nonparametric tests, but you can use StatPlus 
to generate test results using the Wilcoxon Signed Rank test. Apply this test 
now to the work force data. Your hypotheses are 

Hq! Median population difference = 0 
Ha! Median population difference + 0 

To analyze the difference data using the Wilcoxon Signed Rank test: 

Click One Sample Tests from the StatPlus menu and then click 

1 Sample Wilcoxon Sign Rank test. 

Verify that the 1-sample W-test option button is selected. 

Click the Input button, select Diff from the list of range names, and 
click the OK button. 

Click the Output button and select the New Worksheet option button. 
Enter W-test for the new worksheet name and click the OK button. 

Click the OK button to generate the output from the Wilcoxon Signed 
Rank test. See Figure 6-16. 


1 

2 

3 

4 

5 


Figure 6-16 
Wilcoxon 
signed rank 
analysis of the 
Diff data 



The results of our analysis using the Wilcoxon Signed Rank test are simi¬ 
lar to the results with the t test. We still reject the null hypothesis, this time 
with a stronger p value of 0.009. The 95% confidence interval is pretty close 
to what we calculated before, (0.005, 0.060). Moreover, we learn that the 
number of cities whose percentage of women in the workforce increased 
from 1968 to 1972 was 13 and that only 2 cities showed a decrease in 
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percentage points (4 cities were nnchanged). This information strengthens 
onr earlier conclnsion that there was a significant increase in women in the 
workforce dnring those fonr years. 

EXCEL TIPS_ 

^ • Excel doesn’t inclnde a command to perform the signed rank 

test, hnt yon can approximate the p valnes and confidence inter¬ 
vals of the signed rank test hy calcnlating the sign ranks of yonr 
data and then performing a one-sample t test on those valnes. 

• To calcnlate the ranks of yonr data, nse Excel’s RANK fnnction. 

If yonr data contain ties, yon shonld nse the RANKTIED fnnc¬ 
tion. StatPlus required. 

• To calcnlate the signed ranks of yonr data, nse the SIGNRANK 
fnnction. StatPlus required. 


The Sign Test 

Another nonparametric test that makes even fewer assnmptions than the 
Wilcoxon Signed Rank test is the Sign test. In the Sign test, we ignore 
the valnes entirely and simply connt the nnmher of positive valnes and the 
nnmher of negative valnes. We then test whether there are more positive (or 
negative) valnes than there shonld he. The test is similar to what we might 
nse to test whether a two-sided coin is fair. 

The Sign test is nsnally less efficient (reqniring a larger sample size) than 
either the t test or the signed rank test, except in cases where the data come 
from a heavy tailed distrihntion. In those cases, the Sign test may he more 
effective than either the t test or the signed rank test. 

Let’s apply the Sign test to onr data set. Onr hypotheses are 

Hq! Prohahility of a negative valne = prohahility of a positive valne 
Hj,: Prohahility of a negative valne A prohahility of a positive valne 

To analyze the difference data nsing the Sign test: 

1 Click One Sample Tests from the StatPlns menn and then click 1 

Sample Sign test. 

2 Verify that the 1-sample s test option hntton is selected. 

3 Click the Inpnt hntton, select Diff from the list of range names, and 
click the OK hntton. 

4 Click the Ontpnt hntton and select the New Worksheet option hntton. 
Enter s test for the new worksheet name and click the OK hntton. 


Chapter 6 Statistical Inference 253 






Click the OK button to generate the ontpnt from the Sign test. See 
Fignre 6-17. 


5 


Figure 6-17 
Sign test 
analysis of 
the Diff data 



Even nnder the Sign test, we reject the nnll hypothesis and accept the 
alternative, conclnding that the percentage of women in the workforce has 
significantly increased. The p valne for the Sign test is 0.007. We can also 
constrnct a 95% confidence interval with the Sign test, bnt becanse of the 
natnre of the test, we can only approximate the confidence interval at this 
level. The ontpnt in Fignre 6-17 shows the approximate 95% confidence in¬ 
terval of the change in the percentage of women in the workforce to be (0.00, 
0.66). We can also find exact confidence intervals nnder the Sign test that 
are closest to 95% withont going over or going nnder 95%. 

To find the exact confidence intervals under the Sign test: 

1 Click cell D5 and type —1. 

The ontpnt changes to give yon the exact confidence interval that is 
at most 95%. In this case that is a 93.6% confidence interval, and it 
ranges from 0.00 to 0.060. Note that yon cannot change the ontpnt if 
yon are creating static ontpnt. 

2 Click cell D5 and type 1. 

The ontpnt changes and displays the exact confidence interval that is at 
least 95%. Excel displays the 98.1% confidence interval: (0.00, 0.80). 


Yon’ve completed yonr research with the Labor Analysis workbook. 
Yon’ve fonnd that there is snfficient evidence in this sample of 19 cities to 


254 Fundamentals of Statistics 


















conclude that there has heen an increase in the percentage of women par¬ 
ticipating in the workforce during the four-year period from 1968 to 1972. 

To complete your work: 

I Save your changes to the Labor Analysis workbook and close the file. 


The Two-Sample t Test 

In the one-sample or paired t test, you compared the sample average to a 
fixed value specified in the null hypothesis. In a two-sample f test, you com¬ 
pare the averages from two independent samples to determine whether a 
significant difference exists between the samples. For example, one sample 
might contain the cholesterol levels of patients taking a standard drug, while 
the second sample contains cholesterol data on patients taking an experi¬ 
mental drug. You would test to see whether there is a statistically significant 
difference between the two sample averages. 

To compare the sample averages from normally distributed data, you have 
a choice of two t tests. One test statistic, called the unpooled two-sample 
t statistic, has the form 


(Xi X2) (/Aj /A2) 



where x-^ and X 2 are the sample averages for the first and second samples, 
and Sj are the sample standard deviations, Uj and Uj are the sample sizes, 
and /Aj and /Aj are the means of the two distributions. 

This form of the t statistic allows for samples that come from distribu¬ 
tions with different standard deviations, having values of ctj and 0 - 2 . On the 
other hand, it may be the case that both distributions share a common stan¬ 
dard deviation tr. If that is the case, we can construct a t statistic by pooling 
the estimates of the standard deviation from the two samples into a single 
estimate, which we’ll label as s. The value of s is 


s = 


(ui - l)s? -t (n2 - l)s: 


The pooled two-sample t statistic would then be equal to 


(Xi X2) {p-i /A2) 
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Comparing the Pooled and Unpooled Test Statistics 


There are important differences between the two test statistics. The nnpooled 
statistic, althongh we refer to it as a t test, does not strictly follow a t distri- 
bntion. However, we can closely approximate the correct p valnes for this 
statistic by assnming it does and then compare the test statistic to a t distri- 
bntion with degrees of freedom eqnal to 



Here and S 2 are the standard deviations of the valnes in the first and 
second samples. The degrees of freedom for this statistic generally resnlt 
in a fractional valne. In actnal practice, yon’ll probably never have to make 
this calcnlation yonrself; yonr statistics package will do it for yon. 

For the pooled statistic, the sitnation is mnch easier. The pooled t statis¬ 
tic does follow a t distribntion with degrees of freedom eqnal to 

df = n-i + n2 — 2 

If the standard deviations are different and yon apply the pooled t statis¬ 
tic to the data, yon rnn the risk of reporting an erroneons p valne. To gnard 
against this problem, it may be best to perform both a pooled and an nn¬ 
pooled test and then compare the resnlts. If they agree, report the pooled t, 
becanse this test statistic is more widely known. Use the nnpooled t if the 
two tests disagree. Yon shonld also examine the standard deviations of the 
two samples and determine whether they’re close in valne. 


Working with the Two-Sample t Statistic 

To see how the two-sample t test works, let’s consider two gronps of stndents: 
One gronp has learned to write nsing a standard teaching approach, and the 
other has learned nsing a new teaching method. There are 25 stndents in 
each gronp. At the end of the session, each stndent writes an essay that is 
graded on a 100-point scale. The average grade for the gronp 1 stndents is 
75 with a standard deviation of 8. The average for the gronp 2 stndents is 80 
with a standard deviation of 6. Conld the difference in sample averages be 
attribnted to differences between the teaching methods? We assnme that the 
distribntion of the data in both gronps is normal. Onr hypotheses are 
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HqI /Aj /A 2 0 

Ha: /Ai - /A 2 A 0 








where /Tj is the assumed mean of the first distribution and /Tj is the mean 
of the second distribution. Notice that we are not making any assumptions 
about what the actual values of /Tj and /Tj are; we are interested only in the 
difference between them. Because the standard deviations of the two sam¬ 
ples are close in value, we’ll use a pooled t test to test the null hypothesis. 
First, we must calculate the value of the pooled standard deviation, s: 


s = 



- l)sj -I- (Ua - 

Ui -t- Uj — 2 


l)s 


2 

2 


(25 - 1)8^ -t- (25 - 1)6^ 
25 25 - 2 


= 7.07 


Thus, the t statistic equals 


(Xi X2) (p-i ^,2) 



(75 - 80) - 0 

n r 

7.07. — -f- — 
V 25 25 


= -2.5 


which should follow a t distribution with 48 degrees of freedom. Is this a 
significant value? 

To evaluate the t statistic: 

I Open a blank workbook. In cell Al type =TDIST(2.5,48,2) and press 
Enter. 

Note that we enter the value 2.5 rather than —2.5 because the TDIST 
function works with positive values. We enter the value 2 as the 
third parameter because this is a two-sided test. 


Excel returns a p value of 0.016, and we reject the null hypothesis. We 
conclude that the evidence supports the conclusion that students learn bet¬ 
ter how to write using the new teaching method. 
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Testing for Equality of Variance 

Part of the challenge in performing a two-sample t test is determining 
whether to nse a pooled or nnpooled estimate of the variance. Eqnal vari¬ 
ances across different samples is called homogeneity of variance. Three 
ways to test for eqnal popnlation variances are: the F test, Bartlett’s test, and 
Levene’s test. 

The F test is nsed for comparing only two sample variances. The test 
statistic is 

larger variance 
smaller variance 

where F follows the F distrihntion with {n^ — l,n 2 — l) degrees of freedom. 
Here n^ is the nnmher of observations in the sample with the larger vari¬ 
ance, and n 2 is the size of the sample with the smaller variance. Yon’ll learn 
more ahont the F distrihntion in Chapter 9. The F test reqnires normal data 
and it performs hadly if the data deviate from normality. 

Bartlett’s test can he nsed to test for homogeneity of variance from more 
than two samples. Bartlett’s test is sensitive to the effect of nonnormal data. 
If the sample data come from a nonnormal distrihntion, then Bartlett’s test 
may simply he testing for nonnormality rather than homogeneity of vari¬ 
ance. Becanse of its compntational complexity, we will not discnss how to 
calcnlate the Bartlett test statistic here. 

Levene’s test can also he nsed to test for homogeneity of variance from more 
than two samples. The Levene test statistic is less sensitive to departnres from 
normality than the Bartlett test statistic. On the other hand, if the sample data 
do follow a normal distrihntion, then Bartlett’s test has better performance. 

Generally it’s a good idea to nse all three tests (if possible] with yonr data. 
If none of the tests reject the hypothesis of variance homogeneity, yon can 
probably nse a pooled variance estimate in yonr analysis. If one or more of 
the tests reject this hypothesis, yon shonld definitely consider performing 
an nnpooled analysis. 

To calcnlate the valnes of these test statistics, nse the fnnctions described 
in Table 6-5. 


Table 6-5 Homogeneity of Variances functions in Excel 

Function 

FDIST(arrayf, array2) 

BARTLETT [columnl, 
coIumn2, . . .] 


Description 

Retnrns the p valne of the F test for the 
valnes in arrayl and array2. 

Retnrns the p valne of the Bartlett test 
statistic for valnes in columnl, column2, 
and so forth. StatPlus required. 

(continued) 
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BARTLETT2(ra7ues, category) 


LE VENE(co7uinui, 
coIumn2, . . .) 

LEVENE2(ra7ues, category) 


Returns the p value of the Bartlett test 
statistic for values in the values column. 
Samples are indicated in the category 
column. StatPIus required. 

Returns the p value of the Levene test 
statistic for values in columnl, coIumn2, 
and so forth. StatPIus required. 

Returns the p value of the Levene test 
statistic for values in the values column. 
Samples are indicated in the category 
column. StatPIus required. 


The values of these test statistics will also he displayed in the output of 
StatPlus’s two-sample t test command. 


Applying the tTest to Two-Sample Data 

The Nursing Home workbook contains data from a random sample of nurs¬ 
ing homes collected hy the Department of Health and Social Services in 
New Mexico in 1988. The following variables were collected: 

Table 6-6 Nursing Home data 


Range Name 

Range 

Description 

Beds 

A2:A53 

The number of beds in the home 

Medical_Days 

B2:B53 

Annual medical inpatient days (hundreds) 

Total_Days 

C2:C53 

Annual total patient days (hundreds) 

Revenue 

D2:D53 

Annual total patient care revenue 
($hundreds) 

Salaries 

E2:E53 

Annual nursing salaries ($hundreds) 

Expenses 

F2:F53 

Annual facility expenses ($hundreds) 

Location 

G2:G53 

Rural and nomural homes 


To open the Nursing Home workbook: 

1 Open the Nursing Home workbook from the Chapter06 folder. 

2 Save the workbook as Nursing Home Analysis to the same folder. 
The workbook appears as shown in Figure 6-18. 
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Figure 6-18 
The Nursing 
Home workbook 
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You’ve been asked to examine the data and determine whether there was 
a difference in revenne generated by rnral and nonrnral nnrsing homes dnr- 
ing this time period. Here are yonr hypotheses. 

Hq: Mean popnlation revenne for rnral nnrsing homes = 

Mean popnlation revenne for nomnral homes 

Hq! Mean popnlation revenne for rnral nnrsing homes ^ 

Mean popnlation revenne for nomnral homes 

To test the nnll hypothesis yon can nse the StatPlns Two-Sample t test com¬ 
mand. We initially assnme that there is no homogeneity in variance between 
the two samples; however, we’ll reexamine this assnmption as we go along. 

To perform a two-sample f test on the nursing home revenues: 

1 Click Two Sample Tests from the StatPlns menn and then click 

2-Sample f test. 

2 Verify that the Use column of category values option bntton is 
selected. 

Yonr data can be organized in one of two ways: (1) with two separate 
colnmns for each sample or (2) with one colnmn of data valnes and 
one colnmn of category valnes. The nnrsing home data are organized 
in the second way, with the Location colnmn indicating whether a 
particnlar nnrsing home is rnral or nomnral. 
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3 Click the Data Values button and select Revenue from the list of 
range names. 

4 Click the Categories bntton and select Location from the range 
names list. 

5 Click the Use Unpooled Variance option bntton. 

6 Click the Output button and direct the output to a new worksheet 
named t test. Set the output type to Dynamic. 

Figure 6-19 shows the completed dialog box. 


Figure 6-19 
The Two- 
Sample 
t-T est 
dialog box 



use an 
unpooled 
variance 
estimate 


7 


Click the OK button to generate the two-sample t test. Figure 6-20 
shows the completed output. 
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Figure 6-20 
Results of the 
two-sample 
t test analysis 
with unpooled 
variance 


conditions of the hypothesis test 



means 


According to the output, the average revenue for nonrural homes is 
$16,821, w^hereas for rural homes it is $12,827. The data suggest that 
nonrural homes generate more revenue, hut the p value for the t test is 0.057, 
which would cause us not to reject the null hypothesis. The 95% confidence 
interval for the difference in revenue ranges from —$118 to $8,106. 

However, the tests for equality of variance are all nonsignificant. The 
F test p value equals 0.698, the p value of Bartlett’s test equals 0.732, and 
the p value for Levene’s test equals 0.444. This would lead us to believe that 
we can use a pooled estimate for the population standard deviation. Let’s 
redo the test, this time with that assumption. 


1 

2 


To change the output to display the results of a pooled test: 

Click cell D5 in the t Test worksheet. 

Cell D5 contains the value TRUE if a pooled test is used and FALSE 
for unpooled test. 

Replace FALSE with TRUE in cell D5. 

The output changes to display the pooled test results. See Figure 6-21. 
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Figure 6-21 
Results of the 
two-sample 
t test analysis 
with pooled 
variance 


conditions of the hypothesis test 



Under the pooled test the p value changes to 0.048, leading us to reject the 
null hypothesis and accept the alternative: Nonrural homes generate more 
revenue than rural homes. The 95% confidence interval for the difference in 
revenue ranges from $29 to $7,958. Notice that the t test and the confidence 
interval agree, as they always do: the test rejects 0 as the population mean 
and the interval does not include 0. 

We have a problem. Depending on which test we use, we reach a different 
conclusion. The difference between the two tests depends on the assump¬ 
tion we make about the population standard deviation. Should we use the 
pooled or the unpooled results? To get a clearer picture, it would be a good 
idea to create a chart of the two distributions with a boxplot. 

To create a boxplot of the two samples: 

1 Click Single Variable Charts from the StatPlus menu and then click 

Boxplots. 

2 Verify that the Use a column of category levels option button is 
selected. 

3 Click the Data Values button and select Revenue from the list of 
range names. 

4 Click the Categories button and select the range name Location. 

5 Click the Output button and send the chart output to the chart sheet 

Revenue Boxplots. 

6 Click the OK button twice to generate the boxplots. See Figure 6-22. 
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Figure 6-22 
Boxplots of 
the revenue 
data 



Figure 6-22 indicates that there is a moderate outlier for revenues gener¬ 
ated hy rural nursing homes. This outlier may affect the results of our t tests. 
One way of dealing with this outlier is to switch to a nonparametric test, 
which is less influenced hy the presence of outlying values. 


EXCEL TIPS 



• You can also perform a two-sample t test of your data using the 
Analysis ToolPak, supplied hy Excel. To perform a two-sample 
t test, first load the Analysis ToolPak and then click the Data 
Analysis hutton from the Analysis group on the Data tah. 

• For a pooled test, click t Test: Two Sample Assuming Equal Vari¬ 
ances and specify the two columns containing the paired data. 

• For an unpooled test, click t Test: Two Sample Assuming Unequal 
Variances. 

• The Analysis ToolPak does not include confidence intervals for 
the two-sample t. 
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Applying a Nonparametric Test to 
Two-Sample Data 

The two-sample nonparametric test is the Mann-Whitney test. In the Mann- 
Whitney test we rank all of the valnes from smallest to largest and then snm 
the ranks in each sample. Unlike the Wilcoxon test, we do not rank the ahso- 
Inte data valnes or mnltiply the ranks hy the sign of the original data. Table 6-7 
shows an example of two sample data along with the calcnlated ranks. 


Table 6-7 Two-Sample data 


Sample 1 Values 

Ranks 

Sample 2 Values 

Ranks 

22 

12.0 

-3 

3.0 

16 

11.0 

-1 

4.0 

1 

5.0 

2 

6.0 

-4 

1.5 

8 

9.0 

7 

8.0 

-4 

1.5 

3 

7.0 



9 

10.0 




Note that we don’t need to have eqnal sample sizes. Onr nnll hypothesis 
is that hoth samples have the same median valne. In this example, the snm 
of the Sample 1 ranks is 54.5, and the snm of the Sample 2 ranks is 23.5. We 
can nse prohahility theory to determine the prohahility of the first sample 
having a rank snm of 54.5 or greater if the nnll hypothesis were trne. In this 
case, that p valne wonld he 0.176, which wonld not snpport rejecting the 
nnll hypothesis. 

When nsing the Mann-Whitney test, we also need to calcnlate the median 
difference between the two samples. This is done by calcnlating the differ¬ 
ence for each pair of observations taken from Sample 1 and Sample 2 and 
then determining the median of those differences. For the data in Table 6-6, 
there are 35 pairs, starting with the difference between 22 and —3 (the first 
observations in the samples) and going down to the difference between 9 
and —4 (the last observations). The median of these 35 differences is 7. By 
comparison, the difference of the sample averages is 7.31, so the median dif¬ 
ference is pretty close. When the sample sizes get large, these calcnlations 
cannot be easily done by hand. 

The Mann-Whitney test makes only fonr assnmptions. 

1. Both samples are random samples taken from their respective probability 
distribntions. 

2. The samples are independent of each other. 

3. The measnrement scale is at least ordinal. 

4. The two distribntions have the same shape. 
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The Mann-Whitney test relies on ranks; this lessens the effect of ontliers. 
The downside is that nnder certain sitnations, the Mann-Whitney test will 
not he as efficient as the two-sample t test in detecting differences between 
samples. Also, some researchers may not he familiar or comfortable with 
this nonparametric approach. 

Let’s apply the Mann-Whitney test to the nnrsing home data. Onr hypoth¬ 
eses are 


Figure 6-23 
Mann-Whitney 
analysis of the 
revenue data 


Hq! Median popnlation revenne for rnral nnrsing homes = 

Median popnlation revenne for nonrnral homes 

Hq! Median popnlation revenne for rnral nnrsing homes ¥= 

Median popnlation revenne for nonrnral homes 

To apply the Mann-Whitney test to the nnrsing home data: 

Retnrn to the Nnrsing Home Data worksheet and click Two Sample 
Tests from the StatPlns menn and then click 2 Sample Mann-Whitney 
Rank test. 

Verify that the Use colnmn of category values option bntton is 
selected. 

Click the Data Values button and select Revenue from the list of 
range names. 

Click the Categories button and select the range name Location. 

Click the Output button and send your output to a new worksheet 
named MW-test. 

Click OK to generate the Mann-Whitney analysis for the nursing 
home revenue. See Figure 6-23. 
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From the output in Figure 6-23, we note that the median revenne from 
nonrnral nnrsing homes is $17,538, whereas the median revenne of rnral hos¬ 
pitals is $10,921. The median difference is $4,463 (remember that the median 
difference is not the difference between the two medians). This difference is 
statistically significant with a p valne of 0.026. Note that by nsing a test that 
rednces the inflnence of ontliers, we achieve a more significant p valne. 

□nr final decision: We reject the nnll hypothesis that revenne generated 
by rnral homes was eqnal to revenne generated by nonrnral homes and ac¬ 
cept the hypothesis that they were not eqnal. The 95% confidence interval 
gives a range of valnes for this difference. We conclnde that the median dif¬ 
ference is not less than $343 and not more than $9,196. 

To complete your work: 

I Save yonr changes to the workbook and close the file. 


Final Thoughts about Statistical Inference 

The previons example displays some of the challenges and dangers in doing 
statistical inference. It is tempting to see a p valne or a confidence interval 
as the anthoritative answer to yonr research. However, to nse the tools of 
statistical inference properly, yon shonld always be aware of the limitations 
of yonr statistical tests. Here are some general rnles yon shonld follow when 
performing statistical inference. 

1. State yonr hypotheses clearly and, if possible, before collecting and ana¬ 
lyzing yonr data. 

2. Understand the natnre and limitations of the statistical tests yon nse. Be 
aware of any assnmptions that the test makes abont the natnre of yonr 
data. Try to verily that these assnmptions are met (or at least that there is 
no evidence that they are being violated). 

3. Graph yonr data; it will help yon more easily detect any departnres from 
the assnmptions of yonr statistical test. Calcnlate descriptive statistics of 
yonr data for the same reason. 

4. If appropriate, perform more than one kind of statistical test. A different 
test, snch as a nonparametric one, may provide important insight into 
yonr data. 

5. Yonr goal is not to reject the nnll hypothesis. A stndy that fails to reject 
the nnll hypothesis is not a failnre, nor is a low p valne a sign of snccess 
(especially if yon’re rejecting the nnll hypothesis in error). Yonr goal 
shonld be to determine what, if any, conclnsions yon can reach abont 
yonr data in a fair and impartial way and then to ascertain how reliable 
those conclnsions are. 
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Exercises 


1. True or false (and why): A 95% confidence 
interval that covers the range (—5,5) tells 
yon that the prohahility is 95% that /r will 
have a valne between —5 and 5. 

2. True or false (and why): Accepting the 
nnll hypothesis means that the nnll 
hypothesis is true. 

3. True or false (and why): Rejecting the 
null hypothesis means that the null 
hypothesis is false. 

4. Consider a sample of 25 normally distrih- 
uted observations with a sample average 
of 50. 

a. Calculate the 95% confidence interval 
if o- = 20. 

b. Calculate the 95% confidence inter¬ 
val if cr is unknown but if the sample 
standard deviation = 20. 

5. The nationwide mean price for a three- 
year-old Honda Civic DX is $11,500 with 
a known standard deviation of $600. 

You check the newspaper and find 9 
three-year-old Civics in San Francisco 
selling for an average price of $12,000. 
You wonder whether the cost of Civics 
in San Francisco is higher than for the 
rest of the nation. 

a. State your question about the price of 
Civics in terms of a null and an alterna¬ 
tive hypothesis. What are you assuming 
about the distribution of Civic prices? 

b. Will the alternative hypothesis be one 
or two sided? Defend your answer. 

c. Test your null hypothesis. Do you ac¬ 
cept or reject it and at what p value? 
Construct a 95% confidence interval 
for Civic prices in San Francisco. 

d. Redo your analysis, but this time as¬ 
sume that the sample size is 10 with 
a sample average of $12,000 and a 


sample standard deviation of $600. 
Assume that you don’t know the 
value of the nationwide standard 
deviation. 

6. In tests of stereo speakers, ten American- 
made speakers had an average perform¬ 
ance rating of 90 with a standard deviation 
of 5. Five imported speakers had an aver¬ 
age rating of 85 with a standard devia¬ 
tion of 4. 

a. Write a null and an alternative hy¬ 
pothesis comparing the two types of 
speakers. 

b. Test the null hypothesis. What is the 
p value? 

c. If you decide to change the signifi¬ 
cance level to 10%, does your conclu¬ 
sion change? 

7. Derive the formula for the t confidence 
interval based on the definition of the 

t statistic shown earlier in this chapter. 

8. You want to continue to study the nurs¬ 
ing home data discussed in this chapter. 
Explore whether there is a significant 
difference between rural and nonrural 
homes in terms of size (as expressed by 
the number of beds in the homes). 

a. Open the Nursing Home workbook 
from the Chapter06 folder. Save the 
workbook as Nursing Home Beds to 
the same folder. 

b. Write down a set of hypotheses for 
exploring the question of whether the 
numbers of beds in rural and nonrural 
homes differ. 

c. Apply a two-sample t test to the data. 
Report your results assuming a pooled 
estimate of the standard deviation and 
assuming an unpooled estimate. What 
are the p value and confidence inter¬ 
val under each assumption? 
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d. Create a boxplot of the Beds variable 
for the different locations. Do the plot 
and the descriptive statistics give yon 
any help in determining v\rhether to 
nse a pooled or a nonpooled test? 
Explain yonr answer. 

e. Apply the Mann-Whitney test to the 
data. What are yonr hypotheses? What 
is yonr conclnsion? Inclnde informa¬ 
tion on the p valne and confidence 
interval in yonr discnssion. 

f. Save yonr changes to the workbook 
and then write a report snmmarizing 
yonr resnlts. Do yonr conclnsions 
differ on the basis of which test yon 
apply? Which statistical test wonld 
yon report and why? 

9. Retnrn to the nnrsing home data from 
this chapter. This time yon’ve been 
asked to explore whether rnral homes 
are nsed at a lower rate than nonrnral 
homes after adjnsting for the differing 
size of the homes. 

a. Open the Nursing Home workbook 
from the ChapterOB folder and save it 
as Nursing Home Usage Rates. 

b. Create a new variable named 
Days_Beds eqnal to the ratio of the 
total nnmber of patient days to the 
nnmber of beds in the home. Format 
the data to display three decimal 
places. 

c. Compare the average valne of the 
Days_Beds variable for rnral and 
nonrnral homes. What are yonr 
hypotheses? What test or tests 
will yon nse to evalnate yonr nnll 
hypothesis? 

d. Create a boxplot of the Days_Beds 
variable for the two locations. 

e. Save yonr changes to the workbook 
and then write a report snmmarizing 
yonr resnlts, inclnding any descrip¬ 
tive statistics, p valnes, and confi¬ 
dence intervals yon created dnring 


yonr analysis. Is there evidence to 
snggest that rnral homes are being 
ntilized at a lower rate? 

10. Draft nnmbers from the Vietnam War 
have been recorded for yon. (See the 
Chapter 4 exercises for a discnssion of 
the draft lottery.) It’s been claimed that 
people whose birthday fell in the second 
half of the year had lower draft nnmbers 
and therefore were more likely to be 
drafted. Explore this claim. 

a. Open the Draft workbook from the 
ChapterOB folder and save it as Draft 
Number Analysis. 

b. Write down the nnll and alternative 
hypotheses for yonr stndy. 

c. Create a two-sample t test to analyze 
yonr hypotheses. Do yon nse a pooled 
or an nnpooled test? Which type of 
test does the distribntion of the data 
snpport? 

d. Create a histogram of the distribntion 
of draft nnmbers broken down by 
whether the nnmber was assigned in 
the first half of the year or the second. 
What probability distribntion does the 
data resemble? What property of the 

t statistic allows yon still to apply 
the t test to yonr data? 

e. Calcnlate a 95% confidence interval 
for the average draft nnmber for 
people born in the first half of the 
year and then for people born in the 
second half of the year. [Hint. Yon can 
do this nsing StatPlns’s 1-sample ttest 
command, specifying the Half variable 
as the BY variable.) 

f. Save yonr workbook and write a re¬ 
port snmmarizing yonr resnlts. What 
is the mean difference in draft nnm¬ 
ber between people born in the first 
half of the year and those born in the 
second half? Is this a significant dif¬ 
ference? What are the practical ramifi¬ 
cations of yonr conclnsions? 
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11. The Junior College workbook contains 
salary information for faculty at a col¬ 
lege. The female faculty members claim 
that they are underpaid relative to their 
male connterparts. Investigate their 
claim. 

a. Open the Junior College workbook 
from the ChapterOB folder and save it 
as Junior College Salary Analysis. 

b. Write down yonr nnll and alternative 
hypotheses. What is the significance 
level for this test? 

c. Perform a two-sample t test on the sal¬ 
ary data broken down by gender. Does 
it make any difference whether yon 
perform a pooled or an nnpooled test? 
Do the data snggest that there is a 
salary difference between male and 
female facnlty members? Create his¬ 
tograms of the distribntion of salary 
data for male and female instrnctors. 

d. Redo the two-sample t test, this time 
breaking down the analysis of sal¬ 
ary versns gender by the Rank_Hired 
variable. Are there significant gender 
differences in terms of salary for the 
varions employee ranks? [Note-. Some 
combinations of gender and rank hired 
will have sample sizes of 0. This will 
result in Excel displaying a #VALUE! 
resnlt in the workbook. Yon can ignore 
these employee ranks becanse there 
are no valnes to investigate.) 

e. Save yonr workbook and snmmarize 
yonr conclnsions. Is there evidence 
that the college has nnderpaid its 
female facnlty? If so, does this differ¬ 
ence exist for all teaching ranks? Why 
does this stndy not prove sexnal dis¬ 
crimination? What factors have been 
ignored? 

12. The Big Ten workbook has gradnation 
information on Big Ten schools. (See 
Chapter 3 for a discnssion of this data 
set.) Explore whether there is a 


difference in the gradnation rates be¬ 
tween white male athletes and white 
female athletes. 

a. Open the Big Ten workbook from the 
ChapterOB folder and save it as Big 
Ten Graduation Analysis. 

b. State your null and alternative 
hypotheses. 

c. Perform a paired t test of the white 
male and white female athlete gradu¬ 
ation rates. Is there statistically sig¬ 
nificant evidence of a difference in 
the graduation rates? What is a 95% 
confidence interval for the differ¬ 
ence? What is the 90% confidence 
interval? 

d. Redo the analysis using the Wilcoxon 
Signed Rank test and the Sign test. 

e. Why is this an example of paired 
data? 

f. Save your workbook and write a re¬ 
port summarizing your conclusions. 
Can you apply your results to univer¬ 
sities in general? Defend your answer. 

13. The Mortgage workbook contains infor¬ 
mation on refusal rates from 20 lending 
institutions broken down by race and 
income status from the late 1980s. It was 
claimed in reports to Congress that lend¬ 
ing institutions had significantly higher 
refusal rates for minorities. Examine the 
statistical basis for that claim. 

a. Open the Mortgage workbook from 
the ChapterOB folder and save it as 
Mortgage Refusal Analysis. 

b. State your null and alternative 
hypotheses. 

c. Apply a paired t test to the refusal 
rates for minority and white appli¬ 
cants. What is the 95% confidence 
interval for the difference in refusal 
rates? What is the p value for the test? 

d. Create a histogram and normal prob¬ 
ability plot of the difference in refusal 
rate. Do the data appear normal? 
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e. Redo your analysis using the Wil- 
coxon Signed Rank test. How do 
your results compare to the paired 
t test? 

f. Redo questions h through e using the 
refusal rates for high-income whites 
and minorities. How do the results of 
the two analyses compare, especially 
in terms of the confidence interval 
for the difference in refusal rate? Is 
there evidence to suggest that there is 
no refusal rate gap for higher-income 
minorities? 

g. Save your changes to the workbook 
and write a report summarizing your 
conclusions. 

14. The Teacher.xls workbook stores average 
teacher salary, public school spending 
per pupil, and the ratio of teacher to 
pupil spending for 1985, broken down 
by state and region. Analyze the results 
stored in this workbook. 

a. Open the Teacher workbook from 
the ChapterOe folder and save it as 

Teacher Analysis. 

h. Construct a 95% t confidence interval 
for each of the numeric variables, bro¬ 
ken down by area. 

c. Construct a 95% Wilcoxon Signed 
Rank confidence interval for the nu¬ 
meric variables, by area. 

d. Save the changes to your workbook 
and write a report summarizing the 
results of your analysis. 

15. The Pollution workbook contains data 
on the number of unhealthful pollution 
days for 14 U.S. cities comparing 1980 
values to average values from 2000 to 
2006. Analyze what impact environmen¬ 
tal regulations have had on pollution. 

a. Open the Pollution workbook from 
the ChapterOe folder and save it as 
Pollution Analysis, 
h. State your null and alternative 
hypotheses for this analysis. 


c. Create histograms of the ratio and 
difference between the 1980 and 2006 
values. Does the distribution of those 
two variables appear to follow the 
normal distribution? 

d. Analyze the difference and ratio 
values using a one-sample t test, 

s test, and a Wilcoxon Signed Rank 
test. 

e. Save your changes to the workbook. 
Write a report summarizing your 
observations. Where do you see sig¬ 
nificant changes in the number of 
pollution days? Is this true for all sta¬ 
tistical tests? Given the distribution of 
the data, which test appears to be the 
most appropriate for these values? 

16. In a NASA-funded study, seven men and 
eight women spent 24 days in seclusion 
to study the effects of gravity on circula¬ 
tion. Without gravity, there is a loss of 
blood from the legs to the upper part of 
the body. The study started with a nine- 
day control period in which the subjects 
were allowed to walk around. Then 
followed a ten-day bed-rest period in 
which the subjects’ feet were somewhat 
elevated to simulate weightlessness in 
space. The study ended with a five-day 
recovery period in which the subjects 
again were allowed to walk around. 
Every few days, the researchers mea¬ 
sured the electrical resistance at the calf, 
which increases when there is a blood 
loss. The electrical resistance gives an 
indirect measure of the blood loss and 
indicates how the subject’s body re¬ 
sponds to the conditions. You’ve been 
asked to examine whether the male sub¬ 
jects and the female subjects differed in 
how they responded to the study. You’re 
to perform your analysis for each of the 
days in the study. 

a. Open the Space workbook from the 
Chapter06 folder and save it as Space 
Biology Analysis. 
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b. State your null and alternative 
hypotheses. 

c. Perform a two-sample t test compar¬ 
ing the valne of the Resistance vari¬ 
able between the male and female 
snbjects, broken down by day. Yon do 
not have to snmmarize yonr resnlts 
across days. 

d. On what day or days is there a sig¬ 
nificant difference between the two 
gronps? Do yonr resnlts change if 
yon nse an nnpooled rather than 

a pooled estimate of the standard 
deviation? 

e. Create a scatterplot of Resistance ver- 
sns Days. Break the scatterplot down 
by gender nsing the StatPlns com¬ 
mand shown in Chapter 3. Describe 
the effect displayed in the scatter 
plot (yon may want to change the 
scatter plot scales to view the data 
better). 

f. Redo yonr analysis nsing the Mann- 
Whitney test. Do yonr conclnsions 
from part b change any with the non- 
parametric procednre? 

g. Save yonr workbook and write a 
report snmmarizing yonr findings. 
Explain how (if at all) the male and 
female snbjects differed in their re¬ 
sponse to the stndy. Inclnde in yonr 
discnssion the varions parts of the 
stndy (control period, bed-rest period, 
etc.) and how the patients responded 
dnring those specific intervals. 
Inclnde any pertinent statistics. 

17. The Math workbook contains data 

from a stndy analyzing two methods of 
teaching mathematics. Stndents were 
randomly assigned to two gronps: a con¬ 
trol gronp that was tanght in the nsnal 
way with a relaxed homework and qniz 
schednle, and an experimental gronp 
that was regnlarly assigned homework 
and given freqnent qnizzes. Stndents in 


the experimental gronp were allowed to 
retake their exams to raise their grades 
(thongh a different exam was given for 
the retake). The final exam scores of the 
two gronps were recorded. Investigate 
whether there is compelling evidence 
that stndents in the experimental gronp 
had higher scores than those in the con¬ 
trol gronp. 

a. Open the Math workbook from the 
ChapterOe folder and save it as Math 
Scores Analysis. 

b. State yonr nnll and alternative hy¬ 
potheses. Is this a one-sided test or a 
two-sided test? Why? 

c. Perform a two-sample t test on the final 
exam score. Use a pooled estimate of 
the standard deviation. What is the 
95% confidence interval for the differ¬ 
ence in scores? What is the p valne? 
Do yon accept or reject the nnll hy¬ 
pothesis? Do your conclusions change 
if yon nse an nnpooled test? 

d. Chart the distribntion of the final 
exam scores for the two gronps. 

What do the charts tell abont the 
distribntions? Do the charts cast any 
donbt on yonr conclnsions in 

part c? Why? 

e. Do a second analysis of the data nsing 
the Mann-Whitney Rank test. How do 
these resnlts compare to the two-sample 
t? Are your conclusions the same? 

f. Save yonr changes to the workbook 
and write a report snmmarizing yonr 
findings and reporting yonr concln- 
sion. Is there a significant change in 
the exam scores nnder the experimen¬ 
tal approach? 

18. The Voting workbook contains the per¬ 
centage of the presidential vote that the 
Democratic candidate received in 1980 
and 1984, broken down by state and 
region. You’ve been asked to investigate 


272 Fundamentals of Statistics 



the difference between the 1980 and 
1984 voting patterns. 

a. Open the Voting workbook from the 
ChapterOO folder and save it as Voting 
Analysis. 

b. Do a paired t test for the voting per¬ 
centage, broken down by region. Snm- 
marize yonr findings for all regions as 
well. 

c. For which regions was there a signifi¬ 
cant change in the voting percentage? 
For which regions was there no sig¬ 
nificant change? What was the over¬ 
all change in the voting percentage 
across all regions? 

d. Save yonr changes to the workbook 
and write a report snmmarizing yonr 
findings, inclnding yonr descriptive 
statistics, p valnes, and confidence 
intervals. 

19. The Calcnlns workbook shows the first 
semester calcnlns scores for male and 
female stndents. Analyze the data set to 
determine whether there is a significant 
difference between the two gronps. 

a. Open the Calculus workbook from 
the ChapterOe folder and save it as 
Calculus Scores Analysis. 

b. State your null and alternative 
hypotheses. 

c. Perform a two-sample t test on the calc 
values using a pooled estimate of the 
variance. What is the 95% confidence 
interval of the difference between the 
two groups? Do you reject or accept 
the null hypothesis? At what p value? 

d. Chart the distribution of the calc 
values for the two groups. Do the 
distributions appear normal? What 
property of exam scores makes it un¬ 
likely that these exam scores follow 
the normal distribution? [Hint: Test 
scores are usually constrained to fall 
between 0 and 100.) What property of 


the t distribution might allow you to 
use the t test anyway? 

e. Repeat your analysis using the Mann- 
Whitney non-parametric test. 

f. Save your changes to the workbook 
and write a report summarizing your 
findings and stating your conclusions. 

20 . The Reaction workbook contains infor¬ 
mation on reaction times and race times, 
recorded by sprinters running in the first 
three rounds of the 100-meter dash at the 
1996 Summer Olympics. You’re asked to 
determine whether there is evidence that 
the sprinter’s reaction time (the time it 
takes for the sprinter to leave the starting 
block at the sound of the gun) changes as 
he advances in the competition. 

a. Open the Reaction workbook from 
the ChapterOe folder and save it as 
Reaction Time Analysis. 

b. Use the paired t test and analyze the 
differences between the following 
variables: React 1 vs. React 2, React 1 
vs. React 3, and React 2 vs. React 3. 
Calculate the 95% confidence inter¬ 
val for each difference pair, and test 
for statistical significance at the 5% 
level. Are there any pairs of rounds in 
which there is a significant difference 
in the average reaction time? 

c. Create three new columns in the 
Reaction Times worksheet displaying 
the three paired differences, and then 
create three normal probability plots 
of those differences. Does the distri¬ 
bution of the paired differences fol¬ 
low a normal distribution? 

d. Redo the analysis, this time using the 
Wilcoxon Sign Rank test. Do your 
conclusions change when you use 
this test? 

e. Save your changes to the workbook 
and write a report summarizing your 
results and give your conclusions. 
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Have you accepted the null hypoth¬ 
esis, or is there evidence that reaction 
times do change from one ronnd to 
another? 

21. Retnrn to yonr analysis of the resnlts 
of the 1996 100-meter dash. This time, 
analyze the race times from the three 
ronnds of the race. 

a. Open the Race workbook from the 
ChapterOe folder and save it as 
Race Results Analysis. 

b. Perform a paired t test of the race 
times (nse the Racel, Race2, and 
Races variables), comparing the 


differences between Ronnd 1 and 
Ronnd 2 and then Ronnd 2 and 
Ronnd 3. Is there significant evidence 
that the race times decrease as the 
rnnner advances in the competition? 
Calculate the 95% confidence interval 
for the change in race time, 

c. Save yonr changes to the workbook 
and write a report snmmarizing yonr 
resnlts, inclnding any descriptive 
statistics and p valnes. Does yonr evi¬ 
dence snggest any difference in the 
competition level as a rnnner goes 
from Ronnd 1 to Ronnd 2 as compared 
to going from Ronnd 2 to Ronnd 3? 
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Chapter 7 


Tables 

Objectives 

In this chapter you will learn to: 

P- Create PivotTables of a single categorical variable 
P- Create PivotCharts as column and pie charts 
P- Relate two categorical variables with a two-way table 
P- Apply the chi-square test to a two-way table 
P- Compute expected values of a two-way table 
P- Combine or eliminate small categories to get valid tests 
P- Test for association between ordinal variables 
P- Create a custom sort order for your workbook 
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I n this chapter you’ll learn how to work with categorical data in the 
form of tables and ordinal variables. You’ll learn how to use Excel’s 
PivotTable feature to create tables, and you’ll explore how to analyze 
these data in the table using StatPlus. 


PivotTables 

In the previous chapter you used t tests and nonparametric tests to analyze 
continuous variables. You can also apply hypothesis tests to categorical and 
ordinal data. These type of data are most commonly seen in surveys, which 
record counts broken down by categories. For example, you create a table 
of instructors broken down by title (assistant professor, associate professor, 
or full professor) and gender (male or female). Are there significantly more 
male full professors than female? How many female professors would you 
expect given the data? An analysis of categorical data addresses questions of 
this type. 

To illustrate how to work with categorical variables, let’s look at data from 
a survey of professors who teach statistics courses. The Survey workbook 
includes 392 responses to questions about whether the course requires cal¬ 
culus, whether statistical software is used, how the software is obtained by 
students, what kind of computer is used, and so on. The workbook contains 
the following variables shown in Table 7-1: 


Table 7-1 Survey of Statistics Professors Data 


Range Name 

Range 

Computer 

A2:A393 

Dept 

B2:B393 

Available 

C2:C393 

Interest 

D2:D393 

Calculus 

E2:E393 

Uses_Software 

F2:F393 

Enroll_A 

G2:G393 


Enroll_B 

H2:H393 

Max_Cost 

12:1393 


Description 

Computer used in the course 
Department 

Type of computer system available to the student 
The amount of interest in a supplementary statistics text 
The extent to which calculus is required for the course 
Whether the course uses software 
Categorical variable indicating semester course 
enrollment in the instructor’s course (For example, 
001-050 means that from 1 to 50 students are enrolled 
each semester). 

Categorical variable indicating annual course emollment 
Maximum cost for a supplementary computer text 
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To open the Survey workbook: 

1 Start Excel if necessary and maximize the Excel window. 

2 Open the Survey workbook from the ChapterOZ folder. 

3 Save the workbook as Survey Table Statistics. The workbook appears 
as shown in Fignre 7-1. 


Figure 7-1 
Survey data 



Yon’ve been asked to determine the relationship between the depart¬ 
ment and whether calcnlns is reqnired as a prereqnisite for conrses in that 
department. First yon’ll examine the distribntion of professors in different 
department categories. 

In Excel, yon obtain category connts by generating a PivotTable, a work¬ 
sheet table that snmmarizes data from the sonrce data list (in this case 
the snrvey data). Excel PivotTables are interactive; that is, yon can npdate 
them whenever yon change the sonrce data list, and yon can switch row 
and colnmn headings to view the data in different ways (hence the term 
pivot). 

Try creating a PivotTable that snmmarizes the nnmber of professors in 
each department. Yon can insert a PivotTable nsing the commands on the 
Insert tab. 
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To insert a PivotTable: 


Figure 7-2 
Create PivotTable 
dialog box 


I Click the Insert tab from the Excel ribbon and then click the Pivot¬ 
Table bntton from the Tables gronp. 

Excel displays the Create PivotTable dialog box shown in Fignre 7-2 
with the data range Al:l393 already selected for yon. 



2 Verify that the New Worksheet option bntton is selected and then 
click the OK bntton. 

Excel opens a new worksheet containing the PivotTable tools shown 
in Fignre 7-3. Note that Excel has antomatically added a new Pivot¬ 
Table Tools ribbon nsed for creating and editing PivotTables. 
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Now you control the layout of the PivotTable. A PivotTable has four areas. 
The Row Labels determine the categories that will appear in each row of 
the table. Similarly, the Column Labels control the categories for each table 
column. The Values area determines the values that will appear at each in¬ 
tersection of row and column categories. Finally, the Report Filter area is 
used to filter the PivotTable, showing only a subset of all of the data in the 
selected data set. 

You can design your PivotTable by dragging fields from the PivotTable 
Field List box into these different areas. Try this now by creating a Pivot¬ 
Table showing the breakdown of department categories in your data set. 

To create a PivotTable of department categories: 

I Click Dept from the PivotTable Field List box and drag it to the Row 

Labels area box. 

Excel adds the different department categories to the PivotTable. 

Now show the counts within each category. 
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2 Drag Dept from the PivotTable Field List box and drop it in the 
Values box. 

As shown in Figure 7-4, the PivotTable now displays the counts 
within each department. 



The department PivotTable shows the count of different departments 
represented in the survey. From the table you can quickly see that there are 
392 responses in the survey coming from 91 professors in business/economics, 
25 professors in the health sciences, 128 professors in math and science, and 
101 professors in the social sciences. There are also 47 respondents who did 
not specify a department. 

You can edit the PivotTable to remove the respondents that did not spec¬ 
ify a department. 


Removing Categories from a PivotTable 

PivotTables include drop-down list boxes that you can use to specify which 
categories are displayed in the table. Use a list box to remove respondents 
that did not specify a department. 
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To remove a category from the PivotTable: 

Click the Row Labels drop-down box on the PivotTable. 

Deselect the blank checkbox in the list of categories. See Fignre 7-5. 


Figure 7-5 
Removing the 
blank category 
from the 
PivotTable 



3 Click the OK bntton. 

The PivotTable changes and no longer displays resnlts from respon¬ 
dents that did not specify a department. See Fignre 7-6. 


Figure 7-6 
The PivotTable 
with the blank 
category removed 
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The PivotTable now shows a grand total of 345 respondents with the 
blank department category removed from the table. 


Changing the Values Displayed by the PivotTable 

By defanlt, the PivotTable displays the connt for each cell in the table. Yon 
can choose a variety of other types of valnes to display, inclnding snms, 
maximnms, minimnms, averages, and percentages. When yon choose to dis¬ 
play percentages, yon can show the percentage of all of the cells in the table 
or the percentage within each row or colnmn. Try this now by changing the 
table to show the percentage for each category. 

To display percentages of the values within the columns: 

1 Right-click any of the connt valnes in the PivotTable, and then click 
Value Field Settings from the pop-up menu. 

2 Click the Show values as tab in the Value Field Settings dialog box. 

3 Click % of column from the Show values as list box and then click 
the OK button. 

The PivotTable is modified to show values as percentages of the 
total rather than as counts. See Figure 7-7. 


Figure 7-7 
PivotTable 
showing 
percentages 
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You can see at a glance that 26.38% of professors who listed their depart¬ 
ment came from business/economics, 7.25% came from the health sciences, 
37.10% came from math and science, and 29.28% came from the social 
sciences. 
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To view the counts of the responses again: 

1 Right-click any of the percentages in the PivotTable and then click 
Value Field Settings from the pop-up menu. 

2 Click the Show values as tah, select Normal from the Show values 
as list hox, and click the OK button. 


Displaying Categorical Data in a Bar Chart 

You can quickly display your PivotTable data in a PivotChart. The default 
PivotChart is a bar chart, in which the length of the bar is proportional to 
the number of counts in each cell. 

To create a bar chart for department category: 

1 With the PivotTable still selected, click the PivotTable button 
located in the Tools group of the Options tab on the PivotTable Tools 
ribbon. 

2 Select the first column chart from the Insert Chart dialog box and 
click the OK button. 

As shown in Figure 7-8, Excel adds a column chart to the worksheet 
displaying the department counts from the PivotTable. 
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Figure 7-8 
PivotChart of 
department 
affiliation 



The PivotChart works the same as the PivotTable. For example, if you 
click the Axis Fields hutton in the PivotChart Filter Pane, yon can add or 
remove categories from the chart. The pivot chart is also linked to the Pivot¬ 
Table, so any changes yon make to layont or formatting of the chart are anto- 
matically reflected in the PivotTable. 

Colnmn charts are often nsed by statisticians when the need is to show the 
relative sizes of the gronps. For example, it’s qnickly apparent from the chart 
that the mathematics and science departments have the highest representa¬ 
tion of any category in the data set. What is not clear from the chart is the 
size of each gronp compared to the whole. For example, does the MathSci 
gronp make np more than half of the total respondents? It’s not so easy to 
determine that kind of information from the colnmn chart. To deal with that 
problem, we can have Excel add the connts to the chart. 

To add counts to the column chart: 

With the PivotChart still selected, click the Data Labels button from 
the Labels group on the Layout tab of the PivotChart Tools ribbon. 
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2 Click Center from the list of Data Label options. 

Excel updates the chart, which now shows the counts for each col¬ 
umn in the PivotChart. See Figure 7-9. 


Figure 7-9 
Column chart 
with counts 



From this display we can see that 128 respondents come from the MathSci 
group, which corresponds to the information we saw earlier in the PivotTable. 


Displaying Categorical Data in a Pie Chart 

Another way of comparing the size of individual groups to the whole is 
the pie chart. A pie chart displays a circle (like a pie), in which each pie 
slice represents a different category. Let’s see how the pie chart displays the 
department data. Rather than re-creating the chart from scratch, we’ll simply 
change the chart type of the current graph. 

To display a pie chart: 

1 With the PivotChart still selected, click the Change Chart Type but¬ 
ton located on the Type group of the Design tab on the PivotChart 
Tools ribbon. 

2 Select the first Pie Chart subtype listed in the Change Chart Type 
dialog box and click the OK button. 

Excel displays the pie chart in Figure 7-10. 
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Figure 7-10 
Pie chart 
with counts 



From the pie chart we can easily see a graphical comparison of the size 
of each department gronp relative to the entire collection of departments 
represented in the snrvey. The graph retained these connt labels when we 
converted to the pie chart, hnt we can change that option as well. We can 
display the percentage of each slice, and we can add a label to the slice 
identifying which category it represents. 

To change the pie chart’s display options: 

1 With the PivotChart still selected, click the Data Labels bntton again 
and then click More Data Label Options from the menn. 

Excel opens the Format Data Labels dialog box. 

2 Click the Percentage check box and deselect the Valne check box. 

3 Click the Category Name check box. 

4 Click the Close bntton. 

The chart changes as shown in Fignre 7-11, displaying the percent¬ 
ages and labels for each pie slice. 
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Figure 7-11 
Pie chart with 
percentages 
and labels 



The pie chart is an effective tool for displaying categorical data, though it 
tends to he used more often in business reports than in statistical analyses. 

EXCEL TIPS_ 

^ • To vievr the source data for any particular cell in a PivotTable, 

double-click the cell. Excel will open a new worksheet contain¬ 
ing the observations from the original data source that make up 
that cell. 

• You can specify data sources other than the current workbook 
for your PivotTable data, such as databases and external files. 

• If the values in the data source change, the PivotTable updates 
automatically to reflect those changes. 

• You can create your own customized functions for the PivotTable. 
To do so, select any cell in the table and click the Formulas button 
located in the Tools group of the Options tab on the PivotTable 
Tools ribbon, and then click Calculated Field. 
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Two-Way Tables 


What about the relationship between two categorical variables? Yon might 
want to see whether a calcnlns reqnirement for statistics varies by depart¬ 
ment. Excel’s PivotTable featnre can compile a table of calcnlns reqnirement 
by department. 


To create a PivotTable for calculus requirement by department: 

1 Return to the Survey worksheet. 

2 Click the PivotTable button from the Tables group on the Insert tab. 

3 Click the OK button to add a new worksheet containing the Pivot¬ 
Table tools. 

4 Drag Calculus from the PivotTable Field List box and drop it into 
the Column Labels box. 

5 Drag the Dept field to the Row Labels area. 

6 Drag the Dept field onto the Values area. 

Excel creates a PivotTable with counts of the Dept field broken down 
by the Calculus field. See Figure 7-12. 


Figure 7-12 
PivotTable of 
Dept versus 
Calculus 



288 Fundamentals of Statistics 










When you created the first PivotTable, you hid the blank category levels 
for the table. Do this as v\rell for the t-wo-way table. 

To hide the missing values: 

1 Click the Row Labels drop-down list arrow in the PivotTable and 
deselect the blank checkbox. Click OK. 

2 Click the Column Labels drop-down list arrow and deselect the 
blank checkbox. Click OK. 

Figure 7-13 shows the completed PivotTable. 


Figure 7-13 
Two-way table 
of department 
versus calculus 
requirement 
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The table in Figure 7-13 shows frequencies of different combinations of 
department and calculus requirement. For example, cell B5, the intersection of 
Bus,Econ and Not req, shows that 74 professors in the business or economics 
departments do not require calculus as a prerequisite for their statistics course. 
There are a total of 333 responses. Note that missing combinations of Dept and 
Calculus are displayed as blanks in the PivotTable. For example, none of the 
statistics courses offered in the HealthSc category has calculus as a prerequi¬ 
site. How do these values compare when viewed as percentages within each 
department? Let’s modify the PivotTable to find out. 

To show column percentages: 

1 Right-click any of the count values in the table; then click Value Field 
Settings in the pop-up menu. 

2 Click the Show values as tab and then select % of row from the 
Show values as list box. 

3 Click the OK button. The revised PivotTable is shown in Figure 7-14. 
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Figure 7-14 
Two-way table 
of percentages 
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On the basis of the percentages you can quickly observe that for most 
departments calculus is not a prerequisite for statistics, except for courses 
in the math or science departments in which more than one-third of the 
courses have such a requirement. 

How do these percentages compare to the overall total number of respon¬ 
dents in the survey? Let’s find out. 

To calculate percentages of the total responses in the survey: 

1 Right-click any cell in the PivotTable and click Value Field Settings 
from the pop-up menu. 

2 Click the Show values as tab and select % of total from the Show 
values as list box. Click the OK button. Figure 7-15 shows the revised 
PivotTable. 


Figure 7-15 
Percentages 
of the total 
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On the basis of this table, almost 18% of the professors in the survey teach 
a class that has calculus as a prerequisite whereas more than 82% do not. 

To reformat the table to show counts again: 

1 Right-click any of the percentages in the PivotTable; then click Value 
Field Settings in the pop-up menu. 

2 Click the Show values as tab and select Normal from the Show values 
as list box. Click the OK button. 


Computing Expected Counts 

If a calculus prerequisite were the same in each department, we would ex¬ 
pect to find the column percentages (shown in Figure 7-14) to be about the 
same for each department. We would then say that department and calculus 
prerequisite are independent of each other, so that the pattern of usage does 
not depend on the department. It’s the same for all of them. On the other 
hand, if there is a difference between departments, we would say that de¬ 
partment and calculus prerequisite use are related. We cannot say anything 
about whether knowledge of calculus is usually required without knowing 
which department is being examined. 

You’ve seen that there might be a relationship between the calculus vari¬ 
ables and department. Is this difference significant? We could formulate the 
following hypotheses: 

Hj,: The calculus requirement is the same in all departments 
H^: The calculus requirement is related to the department 

How can you test the null hypothesis? Essentially, you want a test statis¬ 
tic that will examine the calculus requirement across departments and then 
compare it to what we would expect to see if the calculus requirement and 
department type were independent variables. 

How do you compute the expected counts? Under the null hypothesis, the 
percentage of courses requiring calculus should be the same across depart¬ 
ments. Our best estimate of these percentages comes from the percentage of 
the grand total, shown in Figure 7-14. Thus we expect about 82.28% of the 
courses to require calculus and about 17.72% not to. To express this value 
in terms of counts, we multiply the expected percentage by the total number 
of courses in each department. For example, there are 119 courses in the 
MathSci departments, and if 17.72% of these had a calculus prerequisite, 
this would be 119 X 0.1772, or about 21.08, courses. Note that the actual 
observed value is 42 (cell C7 in Figure 7-13), so the number of courses that 
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require calculus is higher than expected under the nnll hypothesis. The ex¬ 
pected connt can also he calcnlated nsing the formnla 

(Row total) X (Colnmn total) 

Expected connt = , ■ ■ 

Total observations 

Thus the expected count of conrses that have a calculns prereqnisite in 
the MathSci departments conld also he calcnlated as follows: 

59 X 119 

Expected connt =-= 21.08 

^ 333 

To create a table of expected connts, yon can either nse Excel to perform 
the mannal calcnlations or nse the StatPlns add-in to create the table for yon. 

To create a table of expected counts: 

1 Click Descriptive Statistics and then Table Statistics from the StatPlns 
menn. 

2 Enter the range A4:C8. 

3 Click the Output button; then click the New Worksheet option button 
and type the worksheet name, Calculus Department Table. Click OK. 

4 Click the OK button to start generating the table of expected counts. 
See Figure 7-16. 


Figure 7-16 
Table of 
observed and 
expected 
counts 




B C 

u 

1 Tablii Stnbtfac* 


1 Ofiwmnmtf Comtf 



3 6«--i f 



4 HcMnSc 

. ‘ 


5 



^ Soc:::: 



7 



^40* !•« ■ ■'--'■<1 


9 SwfsEoon 

}' '' 77 


10 

20S7 4 43 


11 M««lSO 

979^ 2108 


SocSo 

17 72 





292 Fundamentals of Statistics 


















The command generates some ontpnt in addition to the table of expected 
connts shown in Fignre 7-16, which weTl discnss later. The valnes in the 
Expected Connts table are the connts we wonld expect to see if a calcnlns 
reqnirement were independent of department. 


The Pearson Chi-Square Statistic 


With our tables of observed counts and expected counts, we need to calculate 
a single test statistic that will summarize the amount of difference between the 
two tables. In 1900, the statistician Karl Pearson devised such a test statistic, 
called the Pearson chi-square. The formula for the Pearson chi-square is 


Pearson chi-square 


2 - 

all cells 


(Observed count — Expected count)^ 
Expected count 


If the frequencies all agreed with their expected values, this total would 
be 0. If there is a substantial difference between the observed and expected 
counts, this value will be large. For the data in Figure 7-16, this value is 


Pearson chi-square 


(74 - 73.23)^ ^ (15 - 15.77)^ ^ (25 - 20.57)^ ^ ^ (2 - 17.72)^ 

73.23 15.77 20.57 17.72 


47.592 


Is this value large or small? Pearson discovered that when the null 
hypothesis is true, values of this test statistic approximately follow a dis¬ 
tribution called the distribution (pronounced “chi-squared”). Therefore, 
one needs to compare the observed value of the Pearson chi-square with the 
X^ distribution to decide whether the value is large enough to warrant rejec¬ 
tion of the null hypothesis. 



CONCEPT TUTORIALS 

The Distribution 


To understand the X^ distribution better, use the explore workbook for 
Distributions. 


To use the Distribution workbook: 

1 Open the Distributions workbook, located in the Explore folder. 
Enable the macros in the workbook. 

2 Click Chi-squared from the Table of Contents column. Review the 
material and scroll to the bottom of the worksheet. See Figure 7-17. 
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Figure 7-17 
distribution 
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Unlike the normal distribution and t distribution, the distribution is 
limited to values > 0. However, like the t distribution, the X^ distribution 
involves a single parameter—the degrees of freedom. When the degrees of 
freedom are low, the distribution is highly skewed. As the degrees of free¬ 
dom increase, the weight of the distribution shifts farther to the right and 
becomes less skewed. To see how the shape of the distribution changes, try 
changing the degrees of freedom in the worksheet. 

To increase the degrees of freedom for the X^ distribution: 

I Click the Degrees of freedom spin arrow and increase the degrees of 
freedom to 3. 

The distribution changes shape as shown in Figure 7-18. 
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Like the normal and t distributions, the distribution has a critical 
boundary for rejecting the null hypothesis, but unlike those distributions, 
it’s a one-sided boundary. There are a few situations where one might use 
upper and lower critical boundaries. 

The critical boundary is shown in your chart with a vertical red line. 
Currently, the critical boundary is set for a = 0.05. In Figure 7-18, this is 
equal to 7.815. You can change the value of a in this worksheet to see the 
critical boundary for other p values. 

To change the critical boundary: 

I Click the p value box, type 0.10, and press Enter. 

The critical boundary changes, moving back to 6.251. 


Experiment with other values for the degrees of freedom and the critical 
boundary. 

When you’re finished with the worksheet: 

1 Close the Distributions workbook. Do not save any changes. 

2 Return to the Survey Table Statistics workbook, displaying the 
Calculus Department Table worksheet. 


The degrees of freedom for the Pearson chi-square are determined by 
the numbers of rows and columns in the table. If there are r rows and 
c columns, the number of degrees of freedom are (r —l) X (c —l). For our 
table of calculus requirement by department, there are 4 rows and 2 col¬ 
umns, and the number of degrees of freedom for the Pearson chi-square 
statistic is (4 —l) X (2 —l), or 3. 

Where does the formula for degrees of freedom come from? The Pearson 
chi-square is based on the differences between the observed and expected 
counts. Note that the sum of these differences is 0 for each row and column 
in the table. For example, in the first column of the table, the expected and 
observed counts are as shown in Table 7-2: 
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Table 7-2 Counts for Calculus Requirement by Department 


Observed 

Expected 

Difference 

74 

73.23 

0.77 

25 

20.57 

4.43 

77 

97.92 

-20.92 

98 

82.28 

15.72 


Sum 

0.00 


Because this sum is 0, the last difference can he calcnlated on the basis of 
the previons three, and there are only three cells that are free to vary in valne. 
Applied to the whole table, this means that if we know 3 of the 8 differences, 
then we can calcnlate the valnes of the remaining 5 differences. Hence the 
nnmber of degrees of freedom is 3. 

Working with the ^ Distribution in Excel 

Now that we know the valne of the test statistic and the degrees of freedom, 
we are ready to test the nnll hypothesis. Excel inclndes several functions to 
help yon work with the distribntion. Table 7-3 shows some of these. 


Table 7-3 Excel Functions for Distribution 


Function 

Description 

CHIDIST(x, d/] 

Returns the p value for the X^ distribution for a given 
value of X and degrees of freedom df. 

CHIINV(p, d/] 

Returns the X^ value from the X^ distribution with 
degrees of freedom d/and p value p. 

CHITEST(ohserved, expected] 

Calculates the Pearson chi square, where observed is a 
range containing the observed counts and expected is 
a range containing the expected counts. 

PEARSONCHISQ(ohserved] 

Calculates the Pearson chi-square, where observed 
is the table containing the observed counts. StatPlus 
required. 

PEARSONP(ohserved) 

Calculates the p value of the Pearson chi-square, where 
observed is the table containing the observed counts. 
StatPlus required. 


The ontpnt yon generated earlier displays (among other things] the valne 
for the Pearson chi-sqnare statistic. The X^ valne is 47.592 with a p valne of 
less than 0.001. Becanse this probability is less than 0.05, yon reject the nnll 
hypothesis that the calcnlns reqnirement does not differ on the basis of the 
department (not snrprisingly). 
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Breaking Down the Chi-Square Statistic 


The value of the Pearson chi-square statistic is huilt up from every cell in 
the table. You can get an idea of v\rhich cells contributed the most to the 
total value by observing the table of standardized residuals. The value of 

the standardized residual is 

Observed count — Expected count 

Standardized residual -- - 

V Expected count 

Figure 7-19 displays the standardized residuals for the Calculus Requirement 
by Department table. Note that the highest standardized residual, 4.56, is found 
for the MathSci department under the Prereq column, leading us to believe that 
this count had the highest impact on rejecting the null hypothesis. 


Figure 7-19 
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Other Table Statistics 

A common mistake is to use the value of to measure the degree of associa¬ 
tion between the two categorical variables. However, the , along with the 
p value, measures only the significance of the association. This is because the 
value of X^ is partly dependent on sample size and the size of the table. For 
example, in a 3 X 3 table, a X^ value of 10 is significant with a p value of 0.04, 
but the same value in a 4 X 4 table is not significant with a p value of 0.35. 

A measure of association, on the other hand, gives a value to the asso¬ 
ciation between the row and column variables that is not dependent on the 
sample size or the size of the table. Generally, the higher the measure of asso¬ 
ciation, the stronger the association between the two categorical variables. 

Figure 7-20 shows other test statistics and measures of association 
created by the StatPlus Table Statistics command. 
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Figure 7-20 
Other table 
statistics 
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Table 7-4 summarizes these statistics and their uses. 


Table 7-4 StatPlus Table Statistics 


Statistic 

Pearson chi-square 


Continuity-adjusted chi-square 
Likelihood ratio chi-square 
Phi 


Contingency 


Cramer’s V 


Description 

Calculates the difference between the observed 
and expected counts. Approximately follows a 
distribution with (r —l) X (c —l) degrees of freedom, 
where r is the number of rows in the table and c is the 
number of table columns. 

Similar to the Pearson chi square, except that it adjusts 
the X^ value for the continuity of the X^ distribution. 

It approximately follows a X^ distribution with 
(r —l) X (c —l) degrees of freedom. 

Measures the association between the row and column 
variables, varying from —1 to 1. A value near 0 
indicates no association. Phi varies from 0 to 1 unless 
the table’s dimension is 2 X 2. 

A measure of association ranging from 0 (no 
association) to a maximum of 1 (high association). The 
upper bound may be less than 1, depending on the 
values of the row and column totals. 

A variation of the contingency measure, modifying the 
statistic so that the upper bound is always 1. 

(continued) 
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Goodman-Kmskal 

gamma 


Kendall’s tau-b 
Stuart’s tau-c 
Somers’ D 


A measure of association used when the row and column 
values are ordinal variables. Gamma ranges from —1 to 1. 
A negative value indicates negative association, a positive 
value indicates positive association, and 0 indicates no 
association between the variables. 

Similar to gamma, except that tau-b includes a correction 
for ties. Used only for ordinal variables. 

Similar to tau-b, except that it includes a correction for 
table size. Used only for ordinal variables. 

A modification of the tau-b statistic. Somers’ D is used for 
ordinal variables in which one variable is used to predict 
the value of the other variable. Somers’ D (RIG) is used 
when the column variable is used to predict the value 
of the row variable. Somers’ D (GIR) is used when the 
row variable is used to predict the value of the column 
variable. 


Because the distribution is a continuous distribution and counts rep¬ 
resent discrete values, some statisticians are concerned that the Pearson 
chi-square statistic is not appropriate. They recommend using the continuity- 
adjusted chi-square statistic instead. We feel that the Pearson chi-square statis¬ 
tic is more accurate and can be used without adjustment. 

Among the other statistics in Table 7-4, the likelihood ratio chi-square 
statistic is usually close to the Pearson chi-square statistic. Many statisti¬ 
cians prefer using the likelihood ratio chi-square because it is used in log- 
linear modeling—a topic beyond the scope of this book. 

All of the three test statistics shown in Figure 7-20 are significant at 
the 5% level. The association between the Galculus Requirement and 
Department variables ranges horn 0.354 to 0.378 for the three measures of 
association (Phi, Gontingency, and Gramer’s V). The final four measures 
of association (gamma, tau-b, tau-c, and Somers’ D) are used for ordinal data 
and are not appropriate for nominal data. 


Validity of the Chi-Square Test with Small 
Frequencies 

One problem you may encounter is that it might not be valid to use the 
Pearson chi-square test on a table with a large number of sparse cells. A 
sparse cell is defined as a cell in which the expected count is less than 5. 
The Pearson chi-square test requires large samples, and this means that cells 
with small counts can be a problem. You might get by with as many as one- 
fifth of the expected counts under 5, but if it’s more than that, the p value 
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returned by the Pearson chi sqnare might lead yon erroneously to reject or 
accept the null hypothesis. The StatPlus add-in will antomatically display a 
warning if this occnrs in yonr sample data. This warning was not displayed 
with the data from the snrvey; however, there was one cell that had a low 
expected connt of 4.43. 

If yon wanted to remove a sparse cell from yonr analysis how wonld yon 
go abont it? Yon can either pool colnmns or rows together to increase the 
cell connts or remove rows or colnmns from the table altogether. 

For these data yon’ll combine the connts from the Bns,Econ, HealthSc, 
and SocSci departments into a single gronp and compare that group to the 
counts in the MathSci department. Rather than editing the data in the work¬ 
sheet, yon can combine the gronps in the PivotTable. 

To compare the MathSci department to all other departments: 

1 Retnrn to the Snrvey worksheet and create another PivotTable on a 
new worksheet with Department in the row area and Calcnlns in the 
colnmn area. Display the connt of Department in the Valnes area. 

2 Remove the blank Department entry from the row area of the Pivot¬ 
Table and remove the blank Calcnlns entry from the Colnmn area of 
the table. 

3 With the Ctrl key held down, click cells A5, A6, and A8 in the Pivot¬ 
Table. This selects the three rows corresponding to the Bns,Econ, 
HealthSc, and SocSci departments respectively. 

4 Click the Group Selection bntton from the Gronp on the Options tab 
of the PivotTable Tools ribbon. Excel adds gronping variables to the 
PivotTable rows as shown in Fignre 7-21. 


Figure 7-21 
Grouping 
rows in the 
PivotTable 
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Next you want to give the two groups descriptive names and collapse the 
PivotTable around the two groups you created. 

To rename and collapse the PivotTable groups: 

1 Click cell A5 in the PivotTable, type Other Depts, and press Enter. 

2 Click the minus boxes in front of the group titles in cells A5 and A9 
to collapse the groups. Figure 7-22 shows the collapsed PivotTable. 


Figure 7-22 
PivotTable with 
grouped rows 
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Notice that the expected cell counts are all much greater than 5. Grouping 
the rows in the PivotTable is a good way to remove problems with sparseness 
in the cell counts. Now that you’ve restructured the PivotTable, rerun your 
analysis, comparing the MathSci department to all of the other combined 
departments. 

To check the table statistics on the revised table: 

1 Click Descriptive Statistics from the StatPlus menu and then click 

Table Statistics. 

2 Enter the range A4:C6. 

3 Click the Output button and send the output to the new worksheet. 
Calculus Department Table 2. 

4 Click the OK button to start generating the new batch of table statis¬ 
tics. See Figure 7-23. 
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Figure 7-23 
Table statistics 
for the grouped 
PivotTable 
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There is little difference in onr conclnsion by grouping the table rows. 
Once again the Pearson chi-square test statistic is highly significant with a 
p value of less than 0.001. Thus we would reject the null hypothesis and 
accept the alternative hypothesis that the MathSci departments are more 
likely to require a calculus prerequisite. 


Tables with Ordinal Variables 


The two-way table you just produced was for two nominal variables. The 
Pearson chi-square test makes no assumption about the ordering of the 
categories in the table. Now let’s look at categorical variables that have 
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inherent order. For ordinal variables, there are more powerful tests than 
the Pearson chi square, which often fails to give significance for ordered 
variables. 

As an example, consider the Calculus and Enroll B variables. Since the 
Calculus variable tells the extent to which calculus is required for a given 
statistics class (Not req or Prereq], it can be treated as either an ordinal or 
a nominal variable. When a variable takes on only two values, there really 
is no distinction between nominal and ordinal, because any two values can 
be regarded as ordered. The other variable. Enroll B, is a categorical vari¬ 
able that contains measures of the size of the annual course enrollment. In 
the survey, instructors were asked to check one of eight categories (0-50, 
51-100, 101-150, 151-200, 201-300, 301-400, 401-500, and 501-] for the 
number of students in the course. You might expect that classes requiring 
calculus would have smaller enrollments. 


Testing for a Relationship between Two Ordinal Variables 

We want to test whether there is a relationship between a class requiring 
calculus as a prerequisite and the size of the class. Our hypotheses are 

Hg! The pattern of enrollment is the same regardless of a calculus prerequisite. 
H^: The pattern of enrollment is related to a calculus prerequisite. 

To test the null hypothesis, first form a two-way table for categorical vari¬ 
ables, Calculus and Enroll B. 

To form the table: 

1 Return to the Survey worksheet and create another PivotTable on a 
new worksheet. 

2 Place the Enroll B field in the Row Labels area of the PivotTable 
and Calculus in the Column Labels area. Also place Calculus in the 
Values area of the table. 

3 Remove blank entries from both the row and column labels in the 
PivotTable. Figure 7-24 shows the final PivotTable. 
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Figure 7-24 
Table of 
Enrollment 
versus 
Calculus 
Prerequisite 
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In the table you just created, Excel automatically arranges the Enroll B 
levels in a combination of numeric and alphabetic order (alphanumeric or¬ 
der), so be careful with category names. For example, if 051-100 were writ¬ 
ten as 51-100 instead. Excel would place it near the bottom of the table 
because 5 comes after 1, 2, 3, and 4. 

Remember that most of the expected values should exceed 5, so there is 
cause for concern about the sparsity in the second column between 201 and 
500, but because only 4 of the 16 cells have expected values less than 5, the 
situation is not terrible. Nevertheless, let’s combine emollment levels from 
200 to 500 of the PivotTable. 

To combine levels, use the same procedure you did with the calculus 
requirement table: 

1 Highlight A9:All, the enrollment row labels for the categories from 
200 through 500. 

2 Click the Group Selection button from the Group on the Options tab 
of the PivotTable Tools ribbon. 

3 Click cell A4, type Enrollment, and press Enter. 

4 Click cell A13, type 201-500, and press Enter. 

5 Click the minus boxes in front of all of the row categories to collapse 
them. See Figure 7-25. 
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Figure 7-25 
Enrollment table 
with pooled 
table rows 
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Now generate the table statistics for the new table. 

To generate table statistics: 


1 Click Descriptive Statistics from the StatPlns menn and then click 

Table Statistics. 

2 Select the range A4:C10. 

3 Click the Output button and send the table to a new worksheet 
named Enrollment Statistics. 

4 Click the OK button. 


The statistics for this table are as shown in Figure 7-26. 


Figure 7-26 
Statistics for the 
Enrollment table 
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It’s interesting to note that those statistics which do not assnme that the 
data are ordinal (the Pearson chi-sqnare, the continnity-adjnsted , and the 
likelihood ratio X^) all fail to reject the nnll hypothesis at the 0.05 level. On 
the other hand, the statistics that take advantage of the fact that we’re nsing 
ordinal data (the Goodman-Krnskal gamma, Kendall’s tan-h, Stnart tan-c, 
and Somers’ D) all reject the nnll hypothesis. This illnstrates an important 
point: Always nse the statistics test that hest matches the characteristics of 
yonr data. Relying on the ordinal tests, we reject the nnll hypothesis, ac¬ 
cepting the alternative hypothesis that the pattern of emollment differs on 
the basis of whether calcnlns is a prereqnisite. 

To explore how that difference manifests itself, let’s examine the table of 
expected valnes and standardized residnals in Fignre 7-27. 


Figure 7-27 
Expected counts 
and standardized 
residuals for the 
enrollment table 
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From the table, we see that the nnll hypothesis nnderpredicts the nnmber 
of conrses with class sizes in the 1-50 range that reqnire knowledge of calcn¬ 
lns. The nnll hypothesis predicts that 12.37 classes fit this classification, and 
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there were 20 of them. As the size of the classes increases, the null hypothesis 
increasingly overpredicts the number of courses that require calculus. For 
example, if the null hypothesis were true, we would expect to see almost 
10 courses with 201-500 students each require knowledge of calculus. The 
observed number from the survey was 6. From this we conclude that class 
size and a calculus prerequisite are not independent and that the courses that 
require knowledge of calculus are more likely to be smaller. 


Custom Sort Order 

With ordinal data you want the values to appear in the proper order when 
created by the PivotTable. If order is alphabetic or the variable itself is 
numeric, this is not a problem. However, what if the variable being consid¬ 
ered has a definite order, but this order is neither alphabetic nor numeric? 
Consider the Interest variable from the Survey workbook, which measures 
the degree of interest in a supplementary statistics text. The values of this 
variable have a definite order (least, low, some, high, most), but this order 
is not alphabetic or numeric. You could create a numeric variable based on 
the values of Interest, such as 1 = least, 2 = low, 3 = some, 4 = high, and 
5 = most. Another approach is to create a custom sort order, which lets you 
define a sort order for a variable. 

You can define any number of custom sort orders. Excel already has some 
built in for your use, such as months of the year (Jan, Feb, Mar, ... Dec), so if 
your data set has a variable with month values, you can sort the data list by 
months (you can also do this with PivotTables). Try creating a custom sort 
order for the values of the Interest variable. 

To create a custom sort order for the values of the Interest variable: 

Click the Office Button and then click Excel Options. 

Click Popular from the list of Excel options. 

Click the Edit Custom Lists button. 

Click the List entries list box. 

Type least and press Enter. 

Type low and press Enter. 

Type some and press Enter. 

Type high and press Enter. 

Type most and click Add. 

Your Custom Lists dialog box should look like Figure 7-28. 
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Figure 7-28 
Custom sort order 
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I 0 Click the OK button twice to return to the workbook. 


Now create a PivotTable of Interest values to see whether Excel automati¬ 
cally applies the sort order you just created. 


1 

2 


To create the PivotTable: 

Return to the Survey worksheet and insert a new PivotTable on a 
new worksheet. 

Drag the Interest field to the Row Labels and Values area of the 
PivotTable. Figure 7-29 shows the resulting PivotTable. 
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Figure 7-29 
Custom sort order 
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Excel automatically sorts the Interest categories in the PivotTable in the 
proper order—least, low, some, high, and most—rather than alphahetically. 

Yon’ve completed yonr work with categorical data with Excel. Yon can 
save yonr changes to the workbook now and close Excel if yon want to take 
a break before starting on the exercises. 


Exercises 

1. Use Excel to calcnlate the following 
p valnes for the distribntion: 

a. = 4 with 4 degrees of freedom 

b. = 4 with 1 degree of freedom 

c. = 10 with 6 degrees of freedom 

d. x^ = 10 with 3 degrees of freedom 

2. Use Excel to calcnlate the following 
critical valnes for the X^ distribntion: 

a. a = 0.10, degrees of freedom = 4 

b. a = 0.05, degrees of freedom = 4 

c. a = 0.05, degrees of freedom = 9 

d. a = 0.01, degrees of freedom = 9 

3. True or false, and why? The Pearson 
chi-sqnare test measnres the degree of 


association between one categorical vari¬ 
able and another. 

4. Yon are snspicions that a die has been 
tampered with. Yon decide to test it. 
After several tosses, yon record the fol¬ 
lowing resnlts shown in Table 7-5: 


Table 7-5 Die-Tossing Experiment 


Number 

1 

2 

3 

4 

5 

6 


Occurrences 

32 

20 

28 

14 
23 

15 
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Use Excel’s CHITEST function to deter¬ 
mine whether this is enongh evidence to 
reject a hypothesis that the die is trne. 

5. Why shonld yon not nse the Pearson 
chi-sqnare test statistic with ordinal 
data? What statistics shonld yon nse 
instead? 

6. The Jnnior College workbook (discnssed 
earlier in Chapter 3) contains informa¬ 
tion on hiring practices at a jnnior col¬ 
lege. Analyze the data from PivotTables 
created for the workbook. 

a. Open the Junior College workbook 
from the ChapterOZ data folder and save 
it as Junior College Table Statistics. 

b. Create a customized list of teaching 
ranks sorted in the following order: 
instructor, assistant professor, associ¬ 
ate professor, full professor. 

c. Create a PivotTable with Rank Hired 
as the row label. Gender as the col¬ 
umn label, and count of rank hired in 
the values area. 

d. Explore the question of whether there 
is a relationship between teaching 
rank and gender. What are your hy¬ 
potheses? Generate the table statistics 
for this PivotTable. Which statistics 
are appropriate to use with the table? 
Is there any difficulty with the data 
in the table? How would you correct 
these problems? 

e. Group the associate professor and full 
professor groups together and redo 
your analysis. 

f. Group the three professor ranks into 
one and redo your analysis, relating 
gender to the instructor/professor split. 

g. Write a report summarizing your re¬ 
sults, displaying the relevant tables 
and statistics. How do your three 
tables differ with respect to your 
conclusions? Discuss some of the 
problems one could encounter when 


trying to eliminate sparse data. Which 
of the three tables best describes the 
data, in your opinion, and would you 
conclude that there is a relationship 
between teacher rank and gender? 
What pieces of information is this 
analysis missing? 

h. Create another PivotTable with Degree 
as the page field. Rank Hired as the 
row field, and Gender as the column 
field. (Because you are obtaining 
counts, you can use either Gender or 
Rank Hired as the data field.) 

i. Using the drop-down arrows on the 
Page field button, display the table 
of persons hired with a Master’s 
degree. 

j. Generate table statistics for this group. 
Is the rank when hired independent 
of gender for candidates with Master’s 
degrees? Redo the analysis if neces¬ 
sary to remove sparse cells. 

k. Save your changes to the workbook 
and write a report summarizing your 
conclusions, including any relevant 
tables and statistics. 

7. The Cold workbook contains data from 
a 1961 French study of 279 skiers dur¬ 
ing two 5-7 day periods. One group of 
skiers received a placebo (an ineffective 
saline solution), and another group re¬ 
ceived 1 gram of ascorbic acid per day. 
The study was designed to measure the 
incidence of the common cold in the 
two groups. 

a. State the null and alternative hypoth¬ 
eses of this study. 

b. Open the Cold workbook from the 
ChapterOZ data folder and save it as 

Cold Statistics. 

c. Analyze the results of the study using 
the appropriate table statistics. 

d. Save your changes to the workbook 
and write a report summarizing your 
results. 
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8. The data in the Marriage workhook 
contain information on the heights of 
newly married conples. The stndy is 
designed to test whether people tend 
to choose marriage partners similar in 
height to themselves. 

a. State the nnll and alternative hypoth¬ 
eses of this stndy. 

b. Open the Marriage workhook from 
the ChapterOZ folder and save it as 
Marriage Analysis. 

c. Analyze the data in the marriage 
table. What test statistics are appro¬ 
priate to nse with these data? Do yon 
accept or reject the nnll hypothesis? 

d. Save yonr changes to the workhook, 
write a report snmmarizing yonr re- 
snlts, and print the relevant statistics 
snpporting yonr conclnsion. 

9. The Gender Poll workhook contains 
polling data on how males and females 
respond to varions social and political 
issnes. Each worksheet in the workhook 
contains a table showing the responses 
to a particnlar qnestion. 

a. Open the Gender Poll workbook from 
the ChapterOZ folder and save it as 
Gender Poll Statistics. 

b. On each worksheet, calcnlate the table 
statistics for the opinion poll table. 

c. For which qnestions is the gender of 
the respondent associated with the 
ontcome? 

d. Save yonr changes to the workbook 
and write a report snmmarizing yonr 
resnlts, specnlating on the types of 
issnes that men and women might 
agree or disagree on. 

10. The Race Poll workbook contains addi¬ 
tional polling data on how different races 
respond to social and political qnestions. 
a. Open the Race Poll workbook from 
the ChapterOZ folder and save it as 
Race Poll Statistics. 


b. On each worksheet, calcnlate the 
table statistics for the opinion poll 
table. Resolve any problem with 
sparse data by combining the Black 
and Other categories. 

c. For which qnestions is the race of 
the respondent associated with the 
ontcome? 

d. Save yonr changes to the workbook 
and write a report snmmarizing yonr 
results, and speculating on the types 
of questions that blacks and mem¬ 
bers of other races might agree or 
disagree on. 

11. The Home Sales workbook contains 
historic data on home prices in Albu¬ 
querque (see Chapter 4 for a discussion 
of this workbook.) 

a. Open the Home Sales workbook from 
the ChapterOZ folder and save it as 
Home Sales Statistics. 

b. Analyze the data to determine 
whether there is evidence that houses 
in the NE sector are more likely to 
have offers pending than houses out¬ 
side the NE sector. 

c. Save your changes to the workbook 
and write a report summarizing your 
conclusions. 

12. The Montana workbook contains poll¬ 
ing data from a survey in 1992 about 
the financial conditions in Montana. 
You’ve been asked to analyze what 
factors influence political affiliation in 
the state. 

a. Open the Montana workbook from 
the ChapterOZ folder and save it as 
Montana Political Statistics. 

b. Create a PivotTable of Political Party 
versus Region (of the state). Remove 
any blank categories from the table. 

Is there evidence to suggest that dif¬ 
ferent regions have different political 
leanings? 
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c. Create a PivotTable of Political Party 
versus Gender, removing any blank 
categories from tbe table. Is there evi¬ 
dence to snggest that political affilia¬ 
tion depends on gender? 

d. Create a PivotTable of Political Party 
versns Age, removing any blank cate¬ 
gories from the table. Age is an ordinal 
variable. Using the appropriate test 
statistic, analyze whether there is any 
significant relation between age and 
political affiliation. 

e. Save yonr changes to the workbook 
and write a report snmmarizing yonr 
conclnsions. 

13. The Montana workbook discnssed in 
Exercise 12 also contains information on 
the pnblic’s assessment of the financial 
statns of the state. Analyze the resnlts 
from this snrvey. 

a. Open the Montana workbook from 
the ChapterOZ folder and save it as 

Montana Financial Statistics. 


b. Create a PivotTable comparing the 
Financial Statns variable to the 
Gender variable. Is there statistical 
evidence that gender plays a roll in 
how individnals view the economy? 

c. Create a PivotTable of Financial Status 
and Income. Note that both are ordinal 
variables. Is there statistical evidence 
of a relation between the two? 

d. Create a PivotTable of Financial 
Statns and Age. Once again note that 
both variables are ordinal. Using the 
appropriate statistical test, determine 
whether there is evidence of a relation 
between age and the assessment of the 
state’s financial statns. 

e. Create a PivotTable comparing Finan¬ 
cial Statns and Political Party. Is there 
statistical evidence of a relation be¬ 
tween the two? 

f. Save yonr changes to the workbook 
and write a report snmmarizing yonr 
observations. 
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Chapter 8 


Regression and Correlation 


Objectives 

In this chapter you will learn to: 

P- Fit a regression line and interpret the coefficients 

P- Understand regression statistics 

P- Use residnals to check the validity of the assnmptions needed for 
statistical inference 

P- Calcnlate and interpret correlations and their statistical significance 

P- Understand the relationship between correlation and simple 
regression 

P- Create a correlation matrix and apply hypothesis tests to the 
correlation valnes 

P- Create and interpret a scatter plot matrix 
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T his chapter examines the relationship between two variables using 

linear regression and correlation. Linear regression estimates a linear 
equation that describes the relationship, whereas correlation measures 
the strength of that linear relationship. 


Simple Linear Regression 

When you plot two variables against each other in a scatter plot, the values 
usually don’t fall exactly in a perfectly straight line. When you perform a 
linear regression analysis, you attempt to find the line that best estimates 
the relationship between two variables (the y, or dependent, variable, and 
the X, or independent, variable). The line you find is called the fitted regres¬ 
sion line, and the equation that specifies the line is called the regression 
equation. 


The Regression Equation 

If the data in a scatter plot fall approximately in a straight line, you can use 
linear regression to find an equation for the regression line drawn over the 
data. Usually, you will not be able to fit the data perfectly, so some points 
will lie above and some below the fitted regression line. 

The regression line that Excel fits will have an equation of the form 
y = a + bx. Here y is the dependent variable, the one you are trying to pre¬ 
dict, and X is the independent, or predictor, variable, the one that is doing 
the predicting. Finally, a and b are called coefficients. Figure 8-1 shows a 
line with a = 10 and b = 2. The short vertical line segments represent the 
errors, also called residuals, which are the gaps between the line and the 
points. The residuals are the differences between the observed dependent 
values and the predicted values. Because a is where the line intercepts 
the vertical axis, a is sometimes called the intercept or constant term in 
the model. Because b tells how steep the line is, b is called the slope. It 
gives the ratio between the vertical change and the horizontal change along 
the line. Here y increases from 10 to 30 when x increases from 0 to 10, 
so the slope is 

Vertical change 30 — 10 

t> = rr ■ 11 =-= 2 

Horizontal change 10 — 0 
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negative 

residual 


Suppose that x is years on the joh and y is salary. Then the y intercept 
(x = O) is the salary for a person with zero years’ experience, the starting 
salary. The slope is the change in salary per year of service. A person with 
a salary above the line wonld have a positive residnal, and a person with a 
salary helow the line wonld have a negative residnal. 

If the line trends downward so that y decreases when x increases, then 
the slope is negative. For example, if x is age and y is price for nsed cars, 
then the slope gives the drop in price per year of age. In this example, the 
intercept is the price when new, and the residnals represent the difference 
between the actnal price and the predicted price. All other things being 
eqnal, if the straight line is the correct model, a positive residnal means a 
car costs more than it shonld, and a negative residnal means a car costs less 
than it shonld (that is, it’s a bargain). 


Fitting the Regression Line 

When fitting a line to data, yon assnme that the data follow the linear model: 
y = a + ^x + e 

where a is the “trne” intercept, (3 is the “trne” slope, and e is an error term. 
When yon fit the line, yon’ll try to estimate a and /3, bnt yon can never know 
them exactly. The estimates of a and (3, we’ll label a and b. The predicted 
valnes of ynsing these estimates, we’ll label y, so that 

y = a + bx 

To get estimates for a and (3, we nse valnes of a and b that resnlt in a 
minimnm valne for the snm of sqnared residnals. In other words, if y, is an 
observed valne of y, we want valnes of a and b snch that 

Snm of sqnared residnals = “ 7;)^ 
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is as small as possible. This procedure is called the least-squares method. 
The values a and b that result in the smallest possible snm for the sqnared 
residnals can be calcnlated from the following formnlas: 


“ ^)(7i “ 7 ) 



a = y — bx 

These are called the least-squares estimates. For example, say our data 
set contains the values listed in Table 8-1: 


Table 8-1 Data for Least-Squares Estimates 


X 

1 

2 

1 

3 

2 


y 

3 

4 

3 

4 

5 


The sample averages for x and y are 1.8 and 3.4, and the estimates for a 
and b are 

_ (1 - 1.8)(3 - 3.4) + (2 - 1.8)(4 - 3.4) + ■ ■ ■ + {2 - 1.8)(5 - 3.4) 

(1 - 1.8)^ + (2 - 1.8)^ + ■ ■ ■ + {2 - 1.8)^ 

= 0.5 

a = y — bx 
= 3.4 - 0.5 X 1.8 
= 2.5 

Thus the least-squares estimate of the regression equation is y = 2.5 -f 0.5x. 


Regression Functions in Excel 

Excel contains several functions to help you calculate the least-squares 
estimates. Two of these are shown in Table 8-2. 
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Table 8-2 Calculating Least-Squares Estimates 


Function Description 

INTERCEPT(y, x) Calculates the least-squares estimate, a, for known 

valnes y and x. 

SLOPE(y, x) Calcnlates the least-sqnares estimate, b, for known 

valnes y and x. 


For example, if the y valnes are in the cell range A2:All, and the x val¬ 
nes are in the range then the fnnction INTERCEPT(A2:All, B2:Bll) 

will display the valne of a, and the fnnction SLOPE(A2:All, B2:Bll] will 
display the valne of b. 

EXCEL TIPS_ 

^ • Yon can also calcnlate linear regression valnes nsing the FINEST 

and LOGEST fnnctions, hnt this is a more advanced topic. Both 
of these functions nse arrays. Yon can learn more about these 
functions and ahont array fnnctions in general hy nsing Excel’s 
online Help. 


Exploring Regression 

If yon wish to explore the concepts behind linear regression, an Explore 
workbook has been provided for yon. 

To explore the regression concept: 

1 Open the Regression workbook in the Explore folder. 

2 Review the contents of the workbook. The Explore Regression work¬ 
sheet shown in Fignre 8-2 allows yon to test the impact of different 
intercept and slope estimates on the snm of the sqnared differences. 
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Figure 8-2 
Explore 
Regression 
worksheet 



Explore with Excel: Regression 

t 

1 

Exploring Regression 

To i«« ko« cC aodb afftot lb* |M<I warn cC 

ttv -baagiit ik* vdon oT ib* ininccpt bad tlop* lo lb* |t«h 
b«to> <oabia«ioa ipp*«it toi*«tll la (hr mmaail Mn of 

iqaacd drffrt cn c*i ' 





OKI 


bT 

owl 


QL 

WCKI 


X 

Y 


DM 


1 

0» 

193 

143 

2 045 

z 

2» 

2 91 

0$i 

0 372 

5 

?1C 

3B< 

079 

9$?l 

4 

6 w 

4 87 

113 

i2rr 

z 

$ 

$ ^ 

005 

0 4r 

t 


$03 

003 

OOOl 

7 

$30 

T81 

151 

2200 

e 

a«o 

8T« 

039 

0 152 

% 

8 10 

9 77 

1$7 

2 701 

10 

11» 

10 75 

■005 

0 722 

1 


ta.an 


3 


After you are finished working with the file, close it. 


Performing a Regression Analysis 

The Breast Cancer workbook contains data from a 1965 study analyzing 
the relationship between mean annual temperature and the mortality rate 
for women with a certain type of breast cancer. The subjects came from 16 
different regions in Great Britain, Norway, and Sweden. Table 8-3 presents 
the data. 
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Table 8-3 Data for the Breast Cancer Workbook 


Range Name 

Range 

Description 

Region 

A2:A17 

A nnmber indicating the region where the data have 
been collected 

Temperatnre 

B2:B17 

The mean annnal temperatnre of the region 

Mortality 

C2:C17 

Mortality index for neoplasms of the female breast 
for the region 


You’ve been asked to determine whether there is evidence of a linear 
relationship between the mean annnal temperatnre in the region and the 
mortality index. Is the mortality index different for women who live in re¬ 
gions with different temperatnres? 

To open the Breast Cancer workbook: 

□pen the Breast Cancer workbook from the ChapterOS folder. 

Save the workbook as Breast Cancer Regression. The workbook 
appears as shown in Fignre 8-3. 


1 

2 



breast cancer 
mortality index 
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Plotting Regression Data 


Before you calculate any regression statistics, yon shonld always plot yonr 
data. A scatter plot can qnickly point ont obvions problems in assnming 
that a linear model fits yonr data (perhaps the scatter plot will show that the 
data valnes do not fall along a straight line). Scatter plots in Excel also allow 
yon to snperimpose the regression line on the plot along with the regression 
eqnation. From this information, yon can get a pretty good idea whether a 
straight line fits yonr data or not. 

To create a scatter plot of the mortality data: 

1 Click Single Variable Charts from the StatPlns menn and then click 

Fast Scatter plot. 

2 Click the x axis bntton, select the Temperature range name and click OK. 
Click the y axis bntton, select the Mortality range name and click OK. 

3 Click the Chart Options button, enter Mortality vs. Temp for the 
Chart title. Temperature for the x axis title, and Mortality for the y 
axis title. Click OK. 

4 Click the Output button and save the chart to a new chart sheet 
named Mortality Scatter plot. Click OK. 

5 Click the OK button to generate the scatter plot. 

6 In the scatter plot that’s created, resize the horizontal scale so that the 
lower boundary is 30 , and then resize the vertical scale so that the lower 
boundary is 50 . Figure 8-4 shows the final version of the scatter plot. 


Figure 8-4 
Scatter plot 
of the 
mortality 
index versus 
mean annual 
temperature 


MoftalHv vs. Temperature 
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Now you’ll add a regression line to the data. 

To add a regression line: 

I Right-click any of the data points in the graph and click Add Trendline 
from the menn. See Fignre 8-5. 


Figure 8-5 
The Add 
Trendline 
command 
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Excel displays a list of possible regression and trend lines. Verify 
that the Linear regression option is selected as shown in Fignre 8-6. 


Figure 8-6 
Choosing a 
trend line 



display the 
regression 
equation on 
the chart 

display the 
regression's 
r2 value 



list of possible 
regression and 
trend lines 
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3 Click the Display Equation on chart and Display R-squared value 
on chart checkboxes and then click the Close bntton. 

Excel adds a regression line to the plot along with the regression 
eqnation and valne. 

4 Drag the text containing the regression eqnation and valne to a 
point above the plot. See Fignre 8-7. 


Figure 8-7 
Fitted 
regression 
line 


regression 

line 



regression 
equation and 
r2 value 


The regression eqnation for the mortality data is y = —21.795 -I- 2.3577x. 
This means that for every degree that the annnal mean temperatnre increased 
in these regions, the breast cancer mortality index increased by abont 
2.3577 points. 

How wonld yon interpret the constant term in this eqnation ( — 21.795)? 
At first glance, this is the y intercept, and it means that if the mean annnal 
temperatnre is 0, the valne of the mortality index wonld be —21.795. Clearly 
this is absnrd; the mortality index can’t drop below zero. In fact, any mean an¬ 
nnal temperatnre of less than 9.24 degrees Fahrenheit will resnlt in a negative 
estimate of the mortality index. This does not mean that the linear eqnation 
is nseless, bnt it means yon shonld be cantions in making any predictions for 
temperatnre valnes that lie ontside the range of the observed data. 

The R^ valne is 0.7654. What does this mean? The valne, also known as 
the coefficient of determination, measures the percentage of variation in the 
values of the dependent variable (in this case, the mortality index) that can be 
explained by the change in the independent variable (temperature). R^ values 
vary from 0 to 1. A value of 0.7654 means that 76.54% of the variation in 
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the mortality index can be explained by the change in annnal mean tempera- 
tnre. The remaining 23.46% of the variation is presumed to be due to random 
variability. 

EXCEL TIPS_ 

^ • You can use the Add Trendline command to add other types 

of least-squares curves, including logarithmic, polynomial, 
exponential, and power curves. For example, instead of fitting 
a straight line, you can fit a second-degree curve to your data. 


Calculating Regression Statistics 

The regression equation in the scatter plot is useful information, but it does 
not tell you whether the regression is statistically significant. At this point, 
you have two hypotheses to choose from. 

Hq! There is no linear relationship between the mortality index and the 
mean annual temperature. 

H^: There is a linear relationship between the mortality index and the 
mean annual temperature. 

The linear relationship we’re testing is expressed in terms of the regression 
equation. 

In order to analyze our regression, we need to use the Analysis ToolPak, 
an add-in that comes with Excel and provides tools for analyzing regres¬ 
sion. If you do not have the Analysis ToolPak loaded on your system, you 
should install it now. Refer to Chapter 1 for information on installing the 
Data Analysis ToolPak. 

To create a table of regression statistics: 

Return to the Mortality Data worksheet. 

Click Data Analysis from the Analysis group on the Data tab to open 
the Data Analysis dialog box. 

Scroll down the list of data analysis tools and click Regression, and 
then click the OK button. 

Enter the cell range Cl:Cl7 in the Input Y Range box. 

Enter the cell range in the Input X Range box. 

Because the first cell in these ranges contains a text label, click the 
Labels checkbox. 


1 

2 

3 

4 

5 

6 
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7 Click the New Worksheet Ply option button and type Regression 
Statistics in the accompanying text box. 

8 Click all four of the Residuals checkboxes. 

Your Regression dialog box should appear as shown in Figure 8-8. 
Note that we did not select the Normal Probability Plots checkbox. 
This option creates a normal plot of the dependent variable. In most 
situations, this plot is not needed for regression analysis. 



9 Click OK. 

Excel generates the output shown in Figure 8-9. 
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Figure 8-9 
Output 
from the 
Regression 
command 
regression 
statistics 

analysis of 
variance table 
parameter 
estimates 


regression and 
residual plots 


residuals and 
predicted values 


The output is divided into six areas: regression statistics, analysis of vari¬ 
ance (ANOVA), parameter estimates, residnal ontpnt, probability ontpnt 
(not shown in Fignre 8-9), and plots. Let’s examine these areas more closely. 
The Regression command doesn’t format the ontpnt for ns, so we may want 
to do that onrselves on the worksheet. 


Interpreting Regression Statistics 


Figure 8-10 
Regression 
statistics 
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You’ve seen some of the regression statistics shown in Fignre 8-10 before. 
The valne of 0.765 yon’ve seen in the scatter plot. The mnltiple R valne is 
eqnal to the sqnare root of the R^ valne. The mnltiple R is eqnal to the abso- 
Inte valne of the correlation between the dependent variable and the predic¬ 
tor variable. Yon’ll learn abont correlation later in this chapter. The adjnsted 
R^ is nsed when performing a regression with several predictor variables. 
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This statistic will be covered more in depth in Chapter 9. Finally, the stan¬ 
dard error measnres the size of a typical deviation of an observed value 
[x, y) from the regression line. Think of the standard error as a way of averag¬ 
ing the size of the deviations from the regression line. The typical deviation 
of an observed point from the regression line in this example is about 7.5447. 
The observations value is the size of the sample used in the regression. In 
this case, the regression is based on the values from 16 regions. 


Interpreting the Analysis of Variance Table 


Figure 8-11 shows the ANOVA table output from the Analysis ToolPak 
Regression command. 


Figure 8-11 
Analysis of 
Variance 
(ANOVA) 
table 
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The ANOVA table analyzes the variability of the mortality index. The vari¬ 
ability is divided into two parts: the first is the variability due to the regres¬ 
sion line, and the second is random variability. 

The values in the d/column of the table indicate the number of degrees of 
freedom for each part. The total degrees of freedom are equal to the number of 
observations minus 1. In this case the total degrees of freedom are 15. Of those 
15 degrees of freedom, 1 degree of freedom is attributed to the regression, and 
the remaining 14 degrees of freedom are attributed to random variability. 

The SS column gives you the sums of squares. The total sum of squares 
is the sum of the squared deviations of the mortality index from the over¬ 
all mean. This total is also divided into two parts. The first part, labeled in 
the table as the regression sum of squares, is the sum of squared deviations 
between the regression line and the overall mean. The second part, labeled 
the residual sum of squares, is equal to the sum of the squared deviations of 
the mortality index from the regression line. Recall that this is the value that 
we want to make as small as possible in the regression equation. In this ex¬ 
ample, the total sum of squares is 3,396.44, of which 2,599.53 is attributed 
to the regression and 796.91 is attributed to error. 

What percentage of the total sum of squares can be attributed to the re¬ 
gression? In this case, it is 2,599.53/3,396.44 = 0.7654, or 76.54%. This is 
equal to the value, which, as you learned earlier, measures the percentage 
of variability explained by the regression. Note also that the total sum of 
squares (3,396.44) divided by the total degrees of freedom (15) equals 226.43, 
which is the variance of the mortality index. The square root of this value is 
the standard deviation of the mortality index. 
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The MS (mean square] column displays the sum of squares divided hy 
the degrees of freedom. Note that the mean square for the residual is equal 
to the square of the standard error in cell B7 (7.5447^ = 56.9218). Thus you 
can use the mean square for the residual to derive the standard error. 

The next column displays the ratio of the mean square for the regression 
to the mean square error of the residuals. This value is called the F ratio. A 
large F ratio indicates that the regression may he statistically significant. In 
this example, the ratio is 45.7. The p value is displayed in the next column 
and equals 0.0000092. Because the p value is less than 0.05, the regression 
is statistically significant. You’ll learn more about analysis of variance and 
interpreting ANOVA tables in an upcoming chapter. 


Parameter Estimates and Statistics 


The file output table created by the Analysis ToolPak Regression command 
displays the estimates of the regression parameters along with statistics 
measuring their significance. See Figure 8-12. 


Figure 8-12 


Parameter 


Co«Ao«nM 

Sttndwdfnv 

t Smt 

— Lomm-9i% 

tiW*r»5V 

estimates 

17 

11 

•21 m 
2m 

15672 

0X5 

1301 

(750 

0 196 55 m 

0000 1(00 

noil 

} 106 

and statistics 

1» 


As you’ve already seen, the constant coefficient, or intercept, equals about 
-21.79, and the slope based on the temperature variable is about 2.36. The 
standard errors for these values are shown in the Standard Error column 
and are 15.672 and 0.349, respectively. The ratio of the parameter estimates 
to their standard errors follows a t distribution with n — 2, or 14, degrees 
of freedom. The ratios for each parameter are shown in the t Stat column, and 
the corresponding two-sided p values are shown in the P value column. In 
this example, the p value for the intercept term is 0.186, and the p value 
for the slope term (labeled Temperature] is 9.2 X 10“®, or 0.0000092 (note 
that this is the same p value that appeared in the ANOVA table). 

The final part of this table displays the 95% confidence interval for each 
of the terms. In this case, the 95% confidence interval for the intercept term 
is about ( — 55.41, 11.82], and the 95% confidence interval for the slope is 
(1.61, 3.11). 

Note: In your output the confidence intervals might appear twice. The first 
pair, a 95% interval, always appears. The second pair always appears, but 
with the confidence level you specify in the Regression dialog box. In this 
case, you used the default 95% value, so that interval appears in both pairs. 

What have you learned from the regression statistics? First of all, 
you would decide to reject the null hypothesis and accept the alternative 
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hypothesis that a linear relationship exists between the mortality index and 
temperatnre. On the basis of the confidence interval for the slope param¬ 
eter, yon can report with 95% confidence that for each degree increase in 
the mean annnal temperatnre, the mortality index for the region increases 
between 1.61 to 3.11 points. 


Residuals and Predicted Values 


The last part of the ontpnt from the Analysis ToolPak’s Regression command 
consists of the residnals and the predicted valnes. See Fignre 8-13 (the valnes 
have been reformatted to make them easier to view). 


Figure 8-13 
Residuals 
and predicted 
values 



As yon’ve learned, the residnals are the differences between the observed 
valnes and the regression line (the predicted valnes). Also inclnded in the 
ontpnt are the standardized residnals. From the valnes shown in Fignre 8-13, 
yon see that there is one residnal that seems larger than the others, it is fonnd 
in the first observation and has a standardized residnal valne of 1.937. Stan¬ 
dardized residnals are residnals standardized to a common scale, regardless 
of the original nnit of measnrement. A standardized residnal whose valne is 
above 2 or below —2 is a potential ontlier. There are many ways to calcnlate 
standardized residnals. Excel calcnlates nsing the following formnla: 


Standardized residnal = 


Residnal 

VSnm of sqnared residnals/(n — l) 


where n is the nnmber of observations in the data set. In this data set, the 
valne of the first standardized residnal is 


14.12 

—= 1.937 
V796.9058/15 

YonTl want to keep an eye on this observation as yon continne to explore 
this regression model. As yon’ll see shortly, the residnals play an important 
role in determining the appropriateness of the regression model. 
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Checking the Regression Model 

As in any statistical procednre, for statistical inference on a regression, yon 
are making some important assnmptions. There are fonr: 

1. The straight-line model is correct. 

2. The error term e is normally distrihnted with mean 0. 

3. The errors have constant variance. 

4 . The errors are independent of each other. 

Whenever yon nse regression to fit a line to data, yon shonld consider these 
assnmptions. Fortnnately, regression is somewhat rohnst, so the assnmptions 
do not need to he perfectly satisfied. 

One point that cannot he emphasized too strongly is that a significant re¬ 
gression is not proof that these assumptions haven’t been violated. To verify 
that yonr data do not violate these assnmptions is to go throngh a series of 
tests, called diagnostics. 


Testing the Straight-Line Assumption 

To test whether the straight-line model is correct, yon shonld first create a 
scatter plot of the data to inspect visnally whether the data depart from this 
assnmption in any way. Fignre 8-14 shows a classic problem that yon may 
see in yonr data. 


Figure 8-14 
A curved 
relationship 



Another sharper way of seeing whether the data follow a straight line is 
to fit the regression line and then plot the residnals of the regression against 
the valnes of the predictor variable. A U-shaped (or npside-down U-shaped) 
pattern to the plot, as shown in Fignre 8-15, is a good indication that the data 
follow a cnrved relationship and that the straight-line assnmption is wrong. 
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Figure 8-15 
Residuals 
showing a 
curved 
relationship 



Let’s apply this diagnostic to the mortality index data. The Regression com¬ 
mand creates this plot for yon, hnt it can he difficnlt to read hecanse of its size. 
We’ll move the plot to a chart sheet and reformat the axes for easier viewing. 


To create a plot of the residuals versus the predictor variable: 

1 Scroll to cell Jl in the Regression Statistics worksheet and click the 

Temperature Residual Plot. 

2 Click the Move Chart button located in the Location group of the 
Design tah on the Chart Tools rihhon. 

3 Click the As new sheet option button and then type Residuals vs. 
Temperature and click OK. 

4 Rescale the horizontal axes of the temperature variable, so that the 
lower boundary is 30. The revised plot appears in Figure 8-16. 


Figure 8-16 
Residuals 
versus 
temperature 
values 
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The plot shows that most of the positive residuals tend he located at the 
lower and higher temperatnres and most of the negative residnals are con¬ 
centrated in the middle temperatnres. This may indicate a cnrve in the data. 
The large first observation is inflnential here. Withont it, there wonld he less 
indication of a cnrve. 


Testing for Normal Distribution of the Residuals 

The next diagnostic is a normal plot of the residnals. The Analysis ToolPak 
does not provide this chart, so we’ll create one with StatPlns. 

To create a Normal Probability Plot of the residuals: 

1 Retnrn to the Regression Statistics worksheet. 

2 Click Single Variable Charts from the StatPlns menn and then click 

Normal P-Plots. 

3 Click the Data Values button. 

4 In the Input Options dialog box, click the Use Range References 
option button and then select the range C24:C40. Verify that the Range 
Include a Row of Column Labels checkbox is selected and click OK. 

5 Click the Output button, verify that the As a New Chart sheet option 
button is selected, and type Residual Normal Plot in the accompa¬ 
nying text box. Click the OK button. 

6 Click the OK button to start generating the Normal Probability plot. 
See Figure 8-17. 
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Recall that if the residuals follow a normal distrihution, they should 
fall evenly along the superimposed line on the normal prohahility plot. 
Although the points in Figure 8-17 do not fall perfectly on the line, the de¬ 
parture is not strong enough to invalidate our assumption of normality. 


Testing for Constant Variance in the Residuals 

The next assumption you should always investigate is the assumption of 
constant variance in the residuals. A commonly used plot to help verify this 
assumption is the plot of the residuals versus the predicted values. This plot 
will also highlight any problems with the straight-line assumption. 


Figure 8-18 
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If the constant variance assumption is violated, you may see a plot like 
the one shown in Figure 8-18. In this example, the variance of the residu¬ 
als is larger for larger predicted values. It’s not uncommon for variahility to 
increase as the value of the response variable increases. If that happens, you 
might remove this problem by using the log of the response variable and 
performing the regression on the transformed values. 

With one predictor variable in the regression equation, the scatter plot 
of the residuals versus the predicted values is identical to the scatter plot of 
the residuals versus the predictor variable (shown earlier in Figure 8-16). The 
scatter plot indicates that there maybe a decrease in the variability of the resid¬ 
uals as the predicted values increase. Once again, though, this interpretation 
is influenced by the presence of the possible outlier in the first observation. 
Without this observation, there might be no reason to doubt the assumption of 
constant variance. 


Testing for the Independence of Residuals 

The final regression assumption is that the residuals are independent of 
each other. This assumption is of concern only in situations where there 
is a defined order for the observations. For example, if we do a regression 
of a predictor variable versus time, the observations will follow a sequen¬ 
tial order. The assumption of independence can be violated if the value of 
one observation influences the value of the next observation. For example. 
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a large value might be followed by a small value, or large and small values 
could be clustered together (see Figure 8-19). In these cases, the residnals do 
not show independence, becanse you can predict what the sign of the next 
value will be on the basis of the cnrrent value. 


Figure 8-19 
Residuals versus 
predicted 
values 




In examining residnals, we can examine the sign of the valnes (either 
positive or negative) and determine how many valnes with the same sign 
are clnstered together. These gronps of similarly signed valnes are called 
rnns. For example, consider a data set of 10 residnals containing 5 positive 
values and 5 negative valnes. The valnes conld follow an order with only 
two rnns, snch as 


-t-t-h-h-h- 

In this case, we wonld snspect that the residnals were not independent, 
becanse the positives and negatives are clnstered together in the seqnence. 
On the other hand, we might have the opposite problem, where there conld 
be as many as ten rnns, snch as 

-h--l-- 

Here, we snspect the residnals are not independent, becanse the re¬ 
sidnals are constantly switching sign. Finally, we might have something 
in-between, snch as 


-h-l---l-H--I- 

which has five rnns. If the nnmber of rnns is very large or very small, we 
wonld snspect that the residnals are not independent. How large (or how 
small) does this valne have to be? Using probability theory, statisticians 
have calcnlated the p valnes for a runs test, associated with the nnmber of 
rnns observed for different sample sizes. If we let n be the sample size, n + 
be the nnmber of positive valnes, and n_ be the nnmber of negative valnes, 
the expected nnmber of rnns p is 

2n + n_ 

u = —=— -H 1 
n 
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and the standard deviation a is 


2n+n_ /2n+n_ 
n(n - l)v n 

If r is the observed number of runs, then the value 

(r - /r + |) 
z = - 

(7 

approximately follov\rs a standard normal distribution for large sample sizes 
(where n+ and n_ are both > 10). For example, if n = 10, n+ = 5, and n_ = 5, 
then /r = 6 and a = V20/9 = 1.49. If 5 runs have been observed, z = —0.335 
and the p value is 0.368. This is very close to the exact p value of 0.357, so 
we would not find this an extremely unusual number of runs. On the other 
hand, if we observe only 3 runs, then z = —2.012 and the p value is .022 
(the exact value is 0.04). 

Another statistic used to test the assumption of independence is the 
Durbin-Watson test statistic. In this test, we calculate the value 

2(6; - e;_i)2 

DW=— - 

i = l 

where 6; is the ith residual in the data set. The value of DW is then com¬ 
pared to a table of Durbin-Watson values to see whether there is evidence 
of a lack of independence in the residuals. Generally, a value of DW 
approximately equal to 0 or 4 suggests that the residuals are not indepen¬ 
dent. A value of DW near 2 suggests independence. Values in between may 
be inconclusive. 

Because the mortality index data are not sequential, you shouldn’t apply 
the runs test or the Durbin-Watson test. Remember, these statistics are most 
useful when the residuals have a definite sequential order. 

After performing the diagnostics on the residuals, you conclude that there 
is no hard evidence to suggest that the regression assumptions have been 
violated. On the other hand, there is a problematic large residual in the first 
observation to consider. You should probably redo the analysis without the 
first observation to see what effect (if any) this has on your model. You’ll 
have a chance to do that in the exercises at the end of the chapter. 

STATPLUS TIPS_ 

• Excel does not include a function to perform the runs test, but you 
can use the Runs Test command from the Time Series submenu 
on the StatPlus menu on your time-ordered residuals to perform 
this analysis. 
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• Use the functions RUNS(range, center) and RUNSP(range, 
center) to calcnlate the nnmher of rnns in a data set and the 
corresponding p valne for a set of data in the cell range, range, 
aronnd the central line center. StatPlus required. 

• Use the fnnction DW(range) to calcnlate the Dnrhin-Watson test 
statistic for the valnes in the cell range range. StatPlus required. 


Correlation 

The valne of the slope in onr regression eqnation is a prodnct of the scale in 
which we measnre onr data. If, for example, we had chosen to express the 
temperatnre valnes in degrees Centigrade, we wonld natnrally have a differ¬ 
ent valne for the slope (thongh, of conrse, the statistical significance of the 
regression wonld not change). Sometimes, it’s an advantage to express the 
strength of the relationship between one variable and another in a dimen¬ 
sionless nnmher, one that does not depend on scale. One snch valne is the 
correlation. The correlation expresses the strength of the relationship on a 
scale ranging from —1 to 1. 

A positive correlation indicates a strong positive relationship, in which 
an increase in the valne of one variable implies an increase in the valne of 
the second variable. This might occnr in the relationship between height and 
weight. A negative correlation indicates that an increase in the first variable 
signals a decrease in the second variable. An increase in price for an object 
conld be negatively correlated with sales. See Fignre 8-20. A correlation of 
zero does not imply there is no relationship between the two variables. One 
can constrnct a nonlinear relationship that prodnces a correlation of zero. 


Figure 8-20 
Correlations 



positive correlation negative correlation 
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The most often used measure of correlation is the Pearson Correlation 
coefficient, which is usually identified with the letter r. The formula for r is 




r = 



For example, the correlation of the data in Table 8-1 is 

_ (l - 1.8)(3 - 3.4) + (2 - 1.8)(4 - 3.4) -t ■ ■ ■ -t- (2 - 1.8)(5 - 3.4) 
V(l - 1.8)^ 4- ■ ■ ■ + (2 - 1.8)^ X V(3 - 3.4)^ 4- ■ ■ ■ -h (5 - 3.4)^ 
- 1.4 

X Vt2 

= 0.763 

which indicates a high positive correlation. 


Correlation and Slope 


Notice that the numerator in the equation for r is exactly the same as the 
numerator for the slope b in the regression equation shown earlier. This is 
important because it means that the slope = 0 when r = 0 and that the sign 
of the slope is the same as the sign of the correlation. The slope can be any 
real number, but the correlation must always be between —1 and -1-1. A cor¬ 
relation of -1-1 means that all of the data points fall perfectly on a line of 
positive slope. In such a case, all of the residuals would be 0 and the line 
would pass right through the points; it would have a perfect fit. 

In terms of hypothesis testing, the following statements are equivalent: 

Hg! There is no linear relationship between the predictor variable and the 
dependent variable. 

Hg: There is no population correlation between the two variables. 

In other words, the correlation is zero if the slope is zero, and vice versa. 
When you do a statistical test for correlation, the assumptions are the same 
as the assumptions for linear regression. 


Correlation and Causality 


Correlation indicates the relationship between two variables without as¬ 
suming that a change in one causes a change in the other. For example, if 
you learn of a correlation between the number of extracurricular activities 
and grade-point average (CPA) for high school students, does this imply that 
if you raise a student’s CPA, he or she will participate in more after-school 
activities? Or that if you ask the student to get more involved in extracurricular 
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activities, his or her grades will improve as a result? Or is it more likely that 
if this correlation is true, the type of people who are good students also tend 
to he the type of people who join after-school groups? You should therefore 
he careful never to confuse correlation with cause and effect, or causality. 


Spearman’s Rank Correlation Coefficient s 

Pearson’s correlation coefficient is not without problems. It can he suscepti¬ 
ble to the influence of outliers in the data set, and it assumes that a straight- 
line relationship exists between the two variables. In the presence of outliers 
or a curved relationship, Pearson’s r may not detect a significant correlation. 
In those cases, you may be better off using a nonparametric measure of cor¬ 
relation, Spearman’s rank correlation, which is usually denoted by the sym¬ 
bol s. As with the nonparametric tests in Chapter 7, you replace observed 
values with their ranks and calculate the value of s on the ranks. Spearman’s 
rank correlation, like many other nonparametric statistics, is less suscep¬ 
tible to the influence of outliers and is better than Pearson’s correlation for 
nonlinear relationships. The downside to the Spearman correlation is that it 
is not as powerful as the Pearson correlation in detecting significant correla¬ 
tions in situations where the parametric assumptions are satisfied. 


Correlation Functions in Excel 

To calculate correlation values in Excel, you can use some of the functions 
shown in Table 8-4. Note that Excel does not include functions to calculate 
Spearman’s rank correlation or the p values for the two types of correlation 
measures. 


Table 8-4 Calculating Correlation Values 


Function 

CORREL(x, y) 
CORRELP(x, y) 

SPEARMAN(x, y] 

SPEARMANP(x, y) 


Description 

Calculates Pearson’s correlation r for the values in x and y. 
Calculates the two-sided p value of Pearson’s correlation for the 
values in x and y. StatPlus required. 

Calculates Spearman’s rank correlation s for the values in x and 
y. StatPlus required. 

Calculates the two-sided p value of Spearman’s rank correlation 
for the values in x and y. StatPlus required. 


Let’s use these functions to calculate the correlation between the mortality 
index and the mean annual temperature for the breast cancer data. 
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To calculate the correlations and p values: 

1 Return to the Mortality Data worksheet. 

2 Enter the labels Pearson’s r, p value, Spearman’s s, and p value in 
the cell range A19:A22. Enlarge the width of colnmn A to fit the size 
of the new labels. 

3 Click cell B19, type =CORREL(temperature, mortality), and press 
Enter. 

4 In cell B20, type =CORRELP(temperature, mortality) and press Enter. 

5 In cell B21, type =SPEARMAN(temperature, mortality) and press 
Enter. 

6 In cell B22, type =SPEARMANP(temperature, mortality) and press 
Enter. 

The correlation valnes are shown in Fignre 8-21. 


Figure 8-21 
Correlations 
and p value 





The valnes in Fignre 8-21 indicate a strong positive correlation between 
the mortality index and the mean annnal temperatnre. The p valnes for both 
measnres are also very significant, indicating that this correlation is statisti¬ 
cally different from zero. Note that the p valne for Pearson’s r is eqnal to the 
p valne for the linear regression shown earlier in Fignre 8-11. One more im¬ 
portant point: the valne of r, 0.875, is eqnal to the sqnare root of the statis¬ 
tic, compnted earlier in Fignre 8-10. It will always be the case that is eqnal 
to the sqnare of Pearson’s correlation coefficient between two variables. 

Yon can close the Breast Cancer Regression Analysis workbook now. 
Yon’ve completed yonr analysis of the data, bnt yon’ll retnrn to it in the 
chapter exercises. 


Creating a Correlation Matrix 

When yon have several variables to stndy, it’s nsefnl to calcnlate the correla¬ 
tions between the variables. In this way, yon can get a qnick pictnre of the 
relationships between the variables, determining which variables are highly 
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correlated and which are not. One way of doing this is to create a correla¬ 
tion matrix, in which the correlations (and associated p valnes] are laid ont 
in a sqnare grid. 

To illnstrate the nse of a correlation matrix, consider the Calcnlns work- 
hook. This file contains data collected to see how performance in a freshman 
calcnlns class is related to varions predictors (Edge and Friedherg, 1984). 
Table 8-5 describes the variables in the Calcnlns workbook. 


Table 8-5 Calculus Workbook Variables 


Range Name 

Range 

Calc_HS 

A2:A81 

ACT_Math 

B2:B81 

Alg_Place 

C2:C81 

Alg2_Grade 

D2:D81 

HS_Rank 

E2:E81 

Gender 

F2:F81 

Gender_Code 

G2:G81 

Calc 

H2:H81 


Description 

Indicates whether calcnlns was taken in high school (0 = no; 
1 = yes) 

The stndent’s score on the ACT mathematics exam 

The stndent’s score on the algebra placement exam given in 

the first week of classes 

The stndent’s grade point in second-year high school algebra 
The stndent’s percentile rank in high school 
The stndent’s gender 

The stndent’s gender code (0 = female; 1 = male) 

The stndent’s grade in calcnlns 


To open the Calculus workbook: 

1 Open the Calculus workbook from the Chapter08 data folder. 

2 Save the workbook as Calculus Correlation Analysis to the same 
folder. The workbook appears as shown in Fignre 8-22. 
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Figure 8-22 
Calculus 
workbook 



Now let’s create a matrix of the correlations for all of the variables in the 
workbook. 


1 

2 
3 


To create a correlation matrix of the numeric variables: 

Click Multivariate Analysis from the StatPlns menu and then click 

Correlation Matrix. 

Click the Data Values button and select all of the variables in the 
workbook except the Gender variable. 

Click the Output button and send the output to a new worksheet 
named Corr Matrix. Click the OK button. Figure 8-23 shows the 
completed dialog box. 
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Figure 8-23 
Create 
Correlation 
box 


4 Click the OK button. 

Excel generates the matrix of correlations as displayed in Fignre 8-24. 


Figure 8-24 
Correlation 
matrix 
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Fignre 8-24 shows two matrices. The first, in cells Al:H9, is the correla¬ 
tion matrix, which shows the Pearson correlations. The second, the matrix of 
probabilities in cells All:Hl9, gives the corresponding two-sided p valnes. 
P valnes less than 0.05 are highlighted in red. 

The most interesting nnmbers here are the correlations with the calcnlns 
score, becanse the object of the stndy was to predict this score. The highest 
correlation appears in cell E4 (0.491), with Alg Place, the algebra placement 
test score. The other correlations, ACT Math, HS Rank, and Calc HS, are not 
impressive predictors when yon consider that the sqnared correlation gives R^, 
the percentage of variance explained by the variable as a regression predictor. 

For example, the correlation between Calc and HS Rank is 0.324 (cell 
H6); the sqnare of this is 0.105, so regression on HS Rank wonld acconnt for 
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only 10.5% of the variation in the calculus score. Another way of saying this 
is that using HS Rank as a predictor improves hy 10.5% the sum of squared 
errors, as compared with using just the mean calculus score as a predictor. 
Note that the p value for this correlation is 0.003 (cell H16), which is less 
than 0.05, so the correlation is significant at the 5% significance level. 

Just because the taking of high school calculus and the subsequent college 
calculus score have a significant correlation, you cannot conclude that taking 
calculus in high school causes a better grade in college. The stronger math 
students tend to take calculus in high school, and these students also do well 
in college. Only if a fair assignment of students to classes could be guaran¬ 
teed (so that the students in high school calculus would be no better or worse 
than others] could the correlation be interpreted in terms of causation. 


Correlation with a Two-Valued Variable 

You might reasonably wonder about using Calc HS here. After all, it assumes 
only the two values 0 and 1. Does the correlation between Calc and Calc HS 
make sense? The positive correlation of 0.324 indicates that if the student 
has taken calculus in high school, the student is more likely to have a high 
calculus grade. 

Another categorical variable in this correlation matrix is Gender Code, 
which has a significant negative correlation with the Alg2 Grade (r = —0.446, 
p value = 0.000] and HS Rank (r = —0.319, p value = 0.004]. Recall that 
in the gender code, 0 = female and 1 = male. A negative correlation here 
means that females tended to have higher grades in second-year algebra and 
were ranked higher in high school. 


Adjusting Multiple p Values with Bonferroni 

The second matrix in Figure 8-24 gives the p values for the correlations. 
Except for Gender, all of the correlations with Calc are significant at the 5% 
level because all the p values are less than 0.05. 

Some statisticians believe that the p values should be adjusted for the 
number of tests, because conducting several hypothesis tests raises above 
5% the probability of rejecting at least one true null hypothesis. The 
Bonferroni approach to this problem is to multiply the p value in each test 
by the total number of tests conducted. With this approach, the probability 
of rejecting one or more of the true hypotheses is less than 5%. 

Let’s apply this approach to correlations of Calc with the other variables. 
Because there are six correlations, the Bonferroni approach would have us 
multiply each p value by 6 (equivalent to decreasing the p value required 
for statistical significance to 0.05/6 = 0.0083]. Alg2 Grade has a p value of 
0.020, and because 6 X (0.020) = 0.120, the correlation is no longer signifi¬ 
cant from this point of view. Instead of focusing on the individual correlation 
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tests, the Bonferroni approach rolls all of the tests into one hig package, with 
0.05 referring to the whole package. 

Bonferroni makes it mnch harder to achieve significance, and many re¬ 
searchers are relnctant to nse it hecanse it is so conservative. In any case, 
this is a controversial area, and professional statisticians argne ahont it. 

EXCELTIPS_ 

^ • To create a correlation matrix with Excel, yon can click the 

Data Analysis hntton from the Analysis gronp on the Data tah 
and then click Correlation from the Data Analysis dialog hox. 
Complete the Correlation dialog hox to create the matrix. 


Creating a Scatter Plot Matrix 

The Pearson correlation measnres the extent of the linear relationship 
between two variables. To see whether the relationship between the variables 
is really linear, yon shonld create a scatter plot of the two variables. In this 
case, that wonld mean creating 15 different scatter plots, a time-consnming 
task! To speed np the process, yon can create a scatter plot matrix. In a scatter 
plot matrix, or SPLOM, yon can create a matrix containing the scatter plots 
between the variables. By viewing the matrix, yon can tell at a glance the na- 
tnre of the relationships between the variables. 

To create a scatter plot matrix: 

1 Click Multi-variable Charts from the StatPlns menn and then click 

Scatter plot Matrix. 

2 Click the Data Values button and select the range names ACT Math, 

Alg Place, Alg2_Grade, Calc, and HS Rank. 

3 Click the Output button, and send the output to the worksheet 

SPLOM. Click the OK button twice. 

Excel generates the scatter plot matrix shown in Figure 8-25. 
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Figure 8-25 
Scatter plot 
matrix 
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Depending on the number of variables you are plotting, SPLOMs can be 
difficult to view on the screen. If you can’t see the entire SPLOM on your 
screen, consider reducing the value in the Zoom Control box. You can also 
reduce the SPLOM by selecting it and dragging one of the resizing handles 
to make it smaller. 

How should you interpret the SPLOM? Each of the five variables is plot¬ 
ted against the other four variables, with the four plots displayed in a row. 
For example, ACT Math is plotted as the y variable against the other four 
variables in the first row of the SPLOM. The first plot in the first row is ACT 
Math versus Alg Place, and so on. On the other hand, the first plot in the 
first column displays Alg Place as the y variable and is plotted against the 
X variable, ACT Math. The scales of the plot are not shown in order to save 
space. If you find a plot of interest, you can recreate it using Excel’s Chart 
Wizard to show more details and information. 

Carefully consider the plots in the second to last row, which show Calc 
against the other variables. Each plot shows a roughly linear upward trend. 
It would be reasonable to conclude here that correlation and linear regres¬ 
sion are appropriate when predicting Calc from ACT Math, Alg2 Grade, Alg 
Place, and HS Rank. 

Recall from Figure 8-24 that Alg Place had the highest correlation with 
Calc. How is that evident here? A good predictor has good accuracy, which 
means that the range of y is small for each x. Of the four plots in the fourth 
row, the plot of Calc against Alg Place has the narrowest range of y values 
for each x. However, Alg Place is the best of a weak lot. None of the plots 
shows that really accurate prediction is possible. None of these plots shows 


344 Statistical Methods 























a relationship anywhere near as strong as the relationship between mortal¬ 
ity index and temperatnre that yon worked with earlier in the chapter. 

Save yonr work and close the Calcnlns Correlation Analysis workbook. 


Exercises 

1. True or false, and why: If the slope of a 
regression line is large, the correlation 
between the variables will also be large. 

2. True or false, and why: If the correla¬ 
tion between two variables is near 1, the 
slope will be a large positive nnmber. 

3. True or false, and why: If the p valne of 
the Pearson’s correlation coefficient is 
low, the p valne of the slope parameter of 
the regression eqnation will also be low. 

4. True or false, and why: A correlation of 
zero means that the two variables are 
nnrelated. 


5. True or false, and why: The rnns test is 
one of the diagnostic tests yon shonld 
always apply to the residnals in yonr 
regression analysis. 

6. In a time-ordered stndy, yon have 25 
residnals from the regression model. 
There are 10 negative residnals and 
15 positive ones. There are a total of 
10 rnns. Is this an unnsnal nnmber of 
rnns? What is the level of statistical 
significance? 

7. Using the following ANOVA table for 
the regression of variable y on variable x, 
answer the qnestions below. 


Table 8-6 Regression of Variable / on x 


ANOVA 

df 

SS 

MS 

F 

Significance F 

Regression 

1 

129.6 

129.6 

4.91 

0.057 

Residual 

8 

210.9 

26.4 



Total 

9 

340.5 





a. How many observations are in the 
data set? 

b. What is the variance of y1 

c. What is the valne of 

d. What percentage of the variability in y 
is explained by the regression? 

e. What is the absolnte value of the cor¬ 
relation of X and y1 

f. What is the p value of the correlation 
of X and y? 

g. What is the standard error (the typical 
deviation of an observed point from 
the regression line)? 


8. Return to the Breast Cancer Mortality 
study discussed in this chapter. There 
may be an outlier in the data set. 
Perform the following analysis to 
determine the effect of this outlier on 
the regression analysis: 

a. Open the Breast Cancer workbook 
from the ChapterOS folder and save it 
as Breast Cancer Ontlier Regression 
to the same folder. 

b. Remove the observation for the first 
region from the data set. 
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c. Create a scatter plot of mortality 
versus temperature and add a linear 
trend line to the plot, showing hoth the 

valne and the regression eqnation. 

d. Calcnlate the regression statistics for 
the new data set. 

e. Create scatter plots of the residnals of 
the regression eqnation versns tem- 
peratnre and the predicted valnes. 
Also create a normal prohahility plot 
of the residnals. 

f. Calcnlate the Pearson and Spearman 
correlation coefficients, inclnding the 
p valnes. 

g. Save yonr changes to the workbook 
and write a report snmmarizing yonr 
resnlts, inclnding a description of the 
diagnostic tests yon performed. How 
does this regression compare with the 
regression yon performed earlier that 
inclnded the possible ontlier? How do 
the diagnostic plots compare? 

9. Yon’ve been given an Excel workbook 
containing nntritional information on 10 
wheat prodncts. Perform the following 
analysis: 

a. Open the Wheat workbook from the 
ChapterOS folder and save it as Wheat 
Regression Analysis. 

h. Plot Calories versns ServingCrams, 
adding a regression line, eqnation, 
and valne to the plot. How does 
the serving size (in grams) predict 
the calories of the different wheat 
prodncts? 

c. Compnte Pearson’s correlation and 
the corresponding p valne between 
Calories and ServingCrams. 

d. Use the Data Analysis ToolPak to cal¬ 
cnlate the statistics for the regression 
eqnation. 

e. Create diagnostic plots of residnals 
versns ServingCrams, and the normal 
probability plot of the residnals. Do 
the regression assnmptions seem to be 
satisfied? 


f. In the plot of residnals versns pre¬ 
dicted valnes, label each point with 
the food type (pretzel, bagel, bread, 
etc). Where do the residnals for the 
breads appear? 

g. Breads are often low in calories be- 
canse of high moistnre content. One 
way of removing the moistnre content 
from the eqnation is to create a new 
variable that snms np the total of the 
nntrient weights. With this in mind, 
create a new variable, total, which 

is the snm of the weights of carbo¬ 
hydrates, proteins, and fats. From this 
total snbtract the valne of the Fiber 
variable since fiber does not contrib- 
nte to the calorie total. Plot Calories 
versns Total on a new chart sheet. 

h. Redo yonr regression eqnation, re¬ 
gressing the Calories variable on the 
new variable Total. How does the di¬ 
agnostic plot of residnals versns pre¬ 
dicted valnes compare to the earlier 
plot? How do the R^ valnes compare? 
Where are the residnals for the bread 
valnes located? 

i. Save yonr changes to the workbook 
and write a report snmmarizing yonr 
observations. 

10 . Continne to investigate the nntritional 

data in the Wheat workbook by perform¬ 
ing the following analysis: 

a. Open the Wheat workbook from the 
ChapterOS folder and save it as Wheat 
Correlation Matrix. 

h. Create scatter plot and correlation ma¬ 
trices (Pearson correlation only) for 
the variables serving grams, calories, 
protein, carbohydrate, and fat. 

c. Why is fat so weakly related to the 
other variables? Given that fat is 
snpposed to be very important in 
calories, why is the correlation so 
weak here? 

d. Wonld the relationship between the 
fat and calories variables be stronger 
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if we used foods that cover a wider 
range of fat-content valnes? 

e. Save yonr changes to the workbook 
and write a report snmmarizing yonr 
observations. 

11. Yon’ve been given a workbook contain¬ 
ing the ages and prices of nsed Mnstangs 
from Cars.com in 2002. Perform the fol¬ 
lowing analysis: 

a. Open the Mustang workbook from 
the ChapterOS folder and save it as 
Mustang Regression Analysis. 

b. Compute the Pearson and Spearman 
correlations (and p valnes) between 
age and price. 

c. Plot price against age. Does this scat¬ 
ter plot canse yon any concern abont 
the validity of the correlations? 

d. How do the correlations change if yon 
concentrate only on cars that are less 
than 20 years old? 

e. Exclnding the old classic cars (older 
than 19 years), perform a regression of 
price against age and find the drop in 
price per year of age. 

f. Do yon see any problems in the diag¬ 
nostic plots of the residnals? 

g. Save yonr changes to the workbook 
and write a report snmmarizing yonr 
observations. 

12. Retnrn to the Calcnlns data set yon ex¬ 
amined in this chapter and perform the 
following analysis: 

a. Open the Calculus workbook from 
the ChapterOS folder and save it as 
Calculus Regression Analysis. 

b. Regress Calc on Alg Place and obtain 
a 95% confidence interval for the 
slope. 

c. Interpret the slope in terms of the in¬ 
crease in final grade when the place¬ 
ment score increases by 1 point. 

d. Do the residnals give yon any canse 
for concern abont the validity of the 
model? 


e. Save yonr changes to the workbook 
and write a report snmmarizing yonr 
observations. 

13. The Booth workbook gives total assets 
and net income for 45 of the largest 
American U.S. banks in 1973. Open 
the workbook and perform the fol¬ 
lowing analysis of this historical eco¬ 
nomic data set: 

a. Open the Booth workbook from the 
ChapterOS folder and save it as Booth 
Regression Analysis. 

b. Plot net income against total assets 
and notice that the points tend to 
bnnch np toward the lower left, with 
jnst a few big banks dominating the 
npper part of the graph. Add a linear 
trend line to the plot. 

c. Regress net income against total as¬ 
sets and plot the standard residnals 
against the predictor valnes. (The 
standardized residnals appear with 
the regression ontput when yon select 
the Standardized Residnals check box 
in the Regression dialog box.) 

d. Given that the residnals tend to be 
bigger for the big banks, yon shonld 
be concerned abont the assnmp- 
tion of constant variance. Try taking 
logs of both variables. Now repeat 
the plot of one against the other, re¬ 
peat the regression, and again look 
at the plot of the residnals against 
the predicted valnes. Does the 
transformation help the relation¬ 
ship? Is there now less reason to be 
concerned abont the assnmptions? 
Notice that some banks have strongly 
positive residnals, indicating good 
performance, and some banks have 
strongly negative residnals, indicat¬ 
ing below-par performance. Indeed, 
bank 20, Franklin National Bank, has 
the second most negative residnal 
and failed the following year. Booth 
(1985) snggests that regression is a 
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good way to locate problem banks 
before it is too late, 
e. Save your changes to the workbook 
and write a report summarizing your 
observations. 

14. You’ve been given a workbook which 
contains mass and volume measure¬ 
ments on eight chunks of aluminum 
from a high school chemistry class. 

a. Open the Aluminum workbook from 
the ChapterOS folder and save it as 
Aluminum Regression Analysis. 

b. Plot mass against volume, and notice 
the outlier. 

c. After excluding the outlier, regress 
mass on volume, without the con¬ 
stant term (select the Constant is Zero 
checkbox in the Regression dialog 
box), because the mass should be 0 
when the volume is 0. The slope of 
the regression line is an estimate of 
the density (not a statistical word here 
but a measure of how dense the metal 
is) of aluminum. 

d. Give a 95% confidence interval for 
the true density. Does your interval 
include the accepted true value, 
which is 2.699? 

e. Save your changes to the workbook 
and write a report summarizing your 
observations. 

15. You’ve been given data containing 
health statistics from 2007 for the 50 
states of the United States. The data 
set contains two variables: Diabetes 
and FluPneum. The Diabetes variable 
contains the death rates (per 100,000) 
for diabetes while the FluPneum vari¬ 
able contains the death rates for causes 
related to the flu or pneumonia. You’ve 
been asked to determine if there is any 
correlation between these two measures. 

a. Open the Health workbook from the 
ChapterOS folder and save it as Health 
Correlation Analysis. 


b. Compute the Pearson correlation and 
the Spearman rank correlation 
between them. How does the 
Spearman rank correlation differ 
from the Pearson correlation? How 
do the p values compare? Are both 
tests significant at the 5% level? 

c. Create the corresponding scatter plot. 
Label each point on the scatter plot 
with the name of the state. Which 
state is a possible outlier on the lower 
left of the plot? 

d. Copy the data to a new worksheet, 
removing the most extreme outlier. 
Redo the correlations and your scatter 
plot. 

e. How are the size and significance of 
the correlations influenced by remov¬ 
ing that one state? Make a case for the 
deletions on the basis of the plot and 
some geography. Does the original 
correlation give an exaggerated no¬ 
tion of the relationship between the 
two variables? Does the nonparamet- 
ric correlation coefficient solve the 
problem? Explain. Would you say that 
a correlation without a plot can be 
deceiving? 

f. Save your workbook and write a re¬ 
port summarizing your observations. 

16. The Fidelity workbook contains finan¬ 
cial data from 1989, 1990, and 1991 for 
33 Fidelity sector funds. The source is 
the Morningside Mutual Fund Source- 
book 1992, Equity Mutual Funds. 

You’ve been asked to explore the rela¬ 
tionships between some of the financial 
variables in this data set. The name of 
the fund is given in the Sector column. 
The TOTL90 column is the percentage 
total return during the year 1990, and 
TOTL91 is the percentage total return 
for the year 1991. NAV90 is the percent¬ 
age increase in net asset value during 
1990, and similarly, NAV91 is the 
percentage change in net asset value 
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during 1991. INC90 is the percentage 
net income for 1990, and similarly, 
INC91 is the percentage net income for 
1991. CAPRET90 is the percentage capi¬ 
tal gain for 1990, and CAPRET91 is the 
percentage capital gain for 1991. 

a. Open the Fidelity workhook from 
the ChapterOS folder and save it as 
Fidelity Financial Analysis. 

b. What is the correlation between the 
percentage capital gains for 1990 and 
1991? Do your analysis using hoth the 
Pearson and Spearman correlations, 
calculating the p value for hoth. Is 
there evidence to support the sup¬ 
position that the percentage capital 
gains from 1990 are highly correlated 
with the percent capital gains for 
1991? 

c. What is the correlation between the 
percentage net income for 1990 and 
1991? Use both the Pearson and 
Spearman correlation coefficents and 
include the p values. Is net income 
from 1990 highly correlated with net 
income from 1991? 

d. Create a scatter plot for the two cor¬ 
relations in parts a and b. Label each 
point on the scatter plot with labels 
from the Sector column. 

e. You should get a stronger correlation 
for income than for capital gains. How 
do you explain this? 

f. Calculate the correlation between the 
percentage increase in net asset value 
in 1990 to 1991 using the NAV90 and 
NAV91 variables and then generate 
the scatter plot, labeling the points 
with the sector names. Note that the 
Biotechnology Fund stands out in the 
plot. It was the only fund that per¬ 
formed well in both years. 

g. Compare the Pearson and Spearman 
correlation values for NAV90 and 
NAV91. Are they the same sign? What 
could account for the different corre¬ 
lation values? Which do you think is 


more representative of the scatter plot 
you created? 

h. If the correlation is this weak, what 
does it suggest about using fund per¬ 
formance in one year as a guide to 
fund performance in the following 
year? 

i. Save your changes to the workbook 
and write a report summarizing your 
observations. 

17. The Draft workbook contains information 
on the 1970 military draft lottery. Draft 
numbers were determined by placing 
all 366 possible birth dates in a rotating 
drum and selecting them one by one. The 
first birth date drawn received a draft 
number of 1 and men born on that date 
were drafted first, the second birth date 
entered received a draft number of 2, and 
so forth. Is there any relationship between 
the draft number and the birth date? 

a. Open the Draft workbook from the 
ChapterOS folder and save it as Draft 
Correlation Analysis. 

b. Using the values in the Draft Numbers 
worksheet, calculate the Pearson and 
Spearman correlation coefficients and 
p value between the Day_of_the_Year 
and the Draft number. Is there a sig¬ 
nificant correlation between the two? 
Using the value of the correlation, 
would you expect higher draft num¬ 
bers to be assigned to people born ear¬ 
lier in the year or later? 

c. Create a scatter plot of Number versus 
Day_of_the_Year. Is there an obvious 
relationship between the two in the 
scatter plot? 

d. Add a trend line to your scatter plot 
and include both the regression equa¬ 
tion and the value. How much 

of the variation in draft number is 
explained by the Day_of_the_Year 
variable? 

e. Calculate the average draft number 
for each month and then calculate 
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the correlation between the month 
nnmber and the average draft nnmber. 
How do the valnes of the correlation 
in this analysis compare with those of 
the correlation yon performed earlier? 

f. Create a scatter plot of average draft 
nnmber versns month nnmber. Add a 
trend line and inclnde the regression 
eqnation and valne. How mnch 

of the variability in the average draft 
nnmber per month is explained by the 
month? 

g. Save yonr changes to the workbook 
and write a report snmmarizing yonr 
conclnsions. Which analysis (looking 
at daily valnes or looking at monthly 
averages) better describes any prob¬ 
lem with the draft lottery? 

18. The Emerald health care providers 
claim that components of their health 
plan canse it to rise significantly more 
slowly than overall health costs. Yon 
decide to investigate to see whether 
there is evidence for Emerald’s claim. 
Yon have recorded Emerald costs over 
the past seven years, along with the 
consnmer price index (CPI) for all nrban 
consnmers and the medical component 
of the CPI. 

a. Open the Emerald workbook from 
the ChapterOS folder and save it as 
Emerald Regression Analysis. 

b. Using the Analysis ToolPak’s Regres¬ 
sion command, calcnlate the regres¬ 
sion eqnation for each of the three 
price indexes against the year vari¬ 
able. What are the valnes for the three 
slopes? Express the slope in terms 

of the increase in the index per year. 
How does Emerald’s change in cost 
compare to the other two indexes? 

c. Look at the 95% confidence intervals 
for the three slopes. Do the confidence 
intervals overlap? Does there appear 
to be a significant difference in the 
rate of increase nnder the Emerald 


plan as compared to the increases 
nnder the other two indexes? 

d. Snmmarize yonr conclnsions. Do yon 
see evidence to snbstantiate Emerald’s 
claim? 

e. Save yonr changes and write a report 
snmmarizing yonr observations. 

19. The Teacher workbook contains data on 
the relationship between teachers’ sala¬ 
ries and the spending on pnblic schools 
per pnpil in 1985. Perform the following 
analysis on this data set: 

a. Open the Teacher workbook from 
the ChapterOS folder and save it as 
Teacher Salary Analysis. 

b. Create a scatter plot of spending per 
pnpil versns teacher salary. Add a 
trend line containing the valne and 
regression to the plot. 

c. Compnte the regression statistics for 
the data, and then create the diagnos¬ 
tic plots discnssed in this chapter. Is 
there any evidence of a problem in 
the diagnostic plots? 

d. Copy the spending per pnpil versns 
teacher salary scatter plot to a new 
chart sheet and then break down the 
points in the plot on the basis of the 
valnes of the area variable. For each 
of the three series in the chart, add a 
linear trend line and compnte the 

R^ valne and regression eqnation. How 
do the least-sqnares lines compare 
among the three regions? What do yon 
think acconnts for any difference in 
the trend lines? 

e. Redo the regression statistics, per¬ 
forming three regressions, one for 
each of the three areas in the data set. 
Compare the regression eqnations. 
What are the 95% confidence inter¬ 
vals for the slope parameters in the 
three areas? 

f. Save yonr changes to the workbook 
and write a report snmmarizing yonr 
observations. 
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20. The Highway workbook contains data 
on highway fatalities per million vehicle 
miles from 1945 to 1984 for the United 
States and the state of New Mexico. 
Yon’ve heen asked to nse regression 
analysis to analyze and compare the 
trend in the fatality rates, 

a. Open the Highway workbook from 
the ChapterOS folder and save it as 
Highway Regression Analysis, 
h. Create a scatter plot that shows the 
New Mexico and U.S. fatality rates 
versns the Year variable. For each 
data series, display the linear regres¬ 
sion line, along with the regression 
eqnation and valne. How mnch of 
the variation in highway fatalities is 
explained by the linear regression line 
for the two data sets? Do the trend 
lines appear to be the same? What 
problems wonld yon see for this trend 
line if it is extended ont for many 
years into the futnre? 

c. Calcnlate the regression statistics for 
both data sets and create residnal 
plots for both regressions. Do the re¬ 
sidnal plots indicate any possible vio¬ 
lations of the regression assnmptions? 

d. Since these are time-ordered data, per¬ 
form a rnns test on the standardized 
residnals for both the New Mexico 
and U.S. data. Calcnlate the Dnrbin- 
Watson test statistic for both sets of 
residnals. Does yonr analysis lead yon 
to believe that one of the regression 
assnmptions has been violated? 

e. Save yonr changes to the workbook 
and write a report summarizing your 
conclusions. 


21. The HomeTax workbook contains data 
on home prices and property taxes for 
houses in Albuquerque, New Mexico, 
sold back in 1993. Many factors were 
involved in assessing the property tax 
for a home during that time. You’ve been 
asked to do a general analysis compar¬ 
ing the price of the home to its assessed 
property tax. 

a. Open the HomeTax workbook from 
the ChapterOS folder and save it as 
HomeTax Regression Analysis. 

b. Create a scatter plot of the tax on each 
home versus that home’s price. Add 

a trend line to the scatter plot and 
include the regression equation and 
value. How much of the variation 
in property taxes is explained by the 
price of the house? 

c. Calculate the regression statistics, 
comparing property tax to home 
price, and create a plot of the 
residuals. 

d. Create a Normal plot of the residuals. 
Is there anything in the two residual 
plots that may violate the regression 
assumptions? 

e. Create two new variables in the work¬ 
book named log(price) and log(tax) 
that contain the BaselO logarithms 

of the price and tax data. Redo steps 
b through d on these transformed 
data. Has the transformation solved any 
problems with the regression assump¬ 
tions on the untransformed values? 
What problems, if any, still remain? 

f. Save your changes to the workbook 
and write a report summarizing your 
conclusions. 
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Chapter 9 


Multiple Regression 



Objectives 

In this chapter you will learn to: 

p- Use the F distribution 

p- Fit a multiple regression equation and interpret the results 
P- Use plots to help understand a regression relationship 
P- Validate a regression using residual diagnostics 
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Regression Models with Multiple Parameters 

In Chapter 8 , you used simple linear regression to predict a dependent vari¬ 
able iy) from a single independent variable (x, a predictor variable). In mul¬ 
tiple regression, you predict a dependent variable from several independent 
variables. For three predictors, x^, Xj, and X 3 , the multiple regression model 
takes the form: 


y 1^0 / 3 l^l + l^2^2 + (^ 3^3 “F e 

where the coefficients are unknown parameter values that you can estimate 
and e is random error, which follows a normal distribution with mean 0 and 
variance cr^. Note that the predictors can also be functions of variables. The 
following are also examples of models whose parameters you can estimate 
with multiple regression: 

Polynomial: y = Pq + /3iX -f ^ 2 ^ + ^ 3 ^ + e 

Trigonometric: y = jSq + jSj sin x + (52 cos x -I- e 

Logarithmic: y = j3o + log x^ -I- ySa log X 2 + s 

Note that all of these equations are examples of linear models, even 
though they use various trigonometric and logarithmic functions. The linear 
in linear model refers to the error term e and the parameters ft. The equa¬ 
tions are linear in those terms. For example, one could create new variables 
1 = sin X and k = cos x, and then the second model is the linear equation 
7 = ^0 + ^ 1 -^ + 1 ^ 2 ^ + e. 

After computing estimated values for the [3 coefficients, you can plug 
them into the equation to get predicted values for y. The estimated regres¬ 
sion model is expressed as 

y = bo + hjXj -I- ^ 2^2 + ^ 3^3 

where the ft’s are the estimated parameter values, and the residuals cor¬ 
respond to the error term e. 



CONCEPT TUTORIALS 

The F distribution 


The F distribution is basic to regression and analysis of variance as studied 
in this chapter and the next. An example of the F distribution is shown in 
the Distributions workbook. 


To view the F distribution: 

I Open the Distributions workbook located in the Explore folder of 
your Student files. Enable the macros in the workbook. 


Chapter 9 Multiple Regression 353 




2 


Click F from the Table of Contents colnmn. Review the material and 
scroll to the bottom of the worksheet. See Fignre 9-1. 


Figure 9-1 
The F 
distribution 
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The F distribntion has two degree-of-freedom parameters: the numerator 
and denominator degrees of freedom. The distribution is usually referred to 
as F[m, n )—that is, the F distribution with m numerator degrees of freedom 
and n denominator degrees of freedom. The Distribution workbook opens 
with an F(4,9] distribution. 

Like the distribution, the F distribution is skewed. To help you better 
understand the shape of the F distribution, the worksheet lets you vary the 
degrees of freedom of the numerator and the denominator by clicking the 
degrees-of-freedom scroll arrows. Experiment with the worksheet to view 
how the distribution of F changes as you increase the degrees of freedom. 

To increase the degrees of freedom in the numerator and 
denominator: 

1 Click the up spin arrow to increase the numerator degrees of freedom 
to 10. 

2 Click the up spin arrow to increase the denominator degrees of freedom 
to 15. Then watch how the distribution changes. 


In this book, hypothesis tests based on the F distribution always use the 
area under the upper tail of the distribution to determine the p value. 
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To change the p value: 


I Click the Critical Value box, type 0.10, and then press Enter. This 
gives you the critical value for the F test at the 10% significance 
level. 


Notice that the critical value shifts to the left, telling you that 10% of the 
values of the F distribution lie to the right of this point. 

Continue working with the F distribution worksheet, trying different 
parameter values to get a feel for the F distribution. Close the workbook 
when you’re finished. You do not need to save any changes you may have 
inadvertently made to the document. 


Using Regression for Prediction 

One of the goals of regression is prediction. For example, you could use 
regression to predict what grade a student would get in a college calculus 
course. (This is the dependent variable, the one being predicted.) The pre¬ 
dictors (the independent variables) might be ACT or SAT math score, high 
school rank, and a placement test score from the first week of class. Students 
with low predictions might be asked to take a lower-level class. 

However, suppose the dependent variable is the price of a four-unit apart¬ 
ment building and the independent variables are the square footage, the age of 
the building, the total current rent, and a measure of the condition of the build¬ 
ing. Here you might use the predictions to find a building that is undervalued, 
with a price that is much less than its prediction. This analysis was actually 
carried out by some students, who found that there was a bargain building 
available. The owner needed to sell quickly as a result of cash flow problems. 

You can use multiple regression to see how several variables combine 
to predict the dependent variable. How much of the variability in the de¬ 
pendent variable is accounted for by the predictors? Do the combined in¬ 
dependent variables do better or worse than you might expect, on the basis 
of their individual correlations with the dependent variable? You might be 
interested in the individual coefficients and in whether they seem to mat¬ 
ter in the prediction equation. Could you eliminate some of the predictors 
without losing much prediction ability? 

When you use regression in this way, the individual coefficients are impor¬ 
tant. Rosner and Woods (1988) compiled statistics from baseball box scores, and 
they regressed runs on singles, doubles, triples, home runs, and walks (walks 
are combined with hit by pitched ball). Their estimated prediction equation is 

Runs = —2.49 I- 0.47 singles + 0.76 doubles -t- 1.14 triples + 1.54 home runs -I- 0.39 walks 
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Notice that walks have a coefficient of 0.39, and singles have a coeffi¬ 
cient of 0.47, so a walk has more than 80% of the weight of a single. This 
is in contrast to the popnlar slngging percentage nsed to measnre the offen¬ 
sive prodnction of players, which gives weight 0 to walks, 1 to singles, 2 to 
donhles, 3 to triples, and 4 to home rnns. The Rosner-Woods eqnation gives 
relatively more weight to singles, and the weight for donhles is less than 
twice as mnch as the weight for singles. Similar comparisons are true for 
triples and home runs. Do hasehall general managers use equations like the 
Rosner-Woods equation to evaluate hall players? If not, why not? 

You can also use regression to see whether a particular group is being 
discriminated against. A company might ask whether women are paid less 
than men with comparable jobs. You can include a term in a regression to 
account for the effect of gender. Alternatively, you can fit a regression model 
for just men, apply the model to women, and see whether women have sala¬ 
ries that are less than would be predicted for men with comparable posi¬ 
tions. It is now common for such arguments to be offered as evidence in 
court, and many statisticians have experience in legal proceedings. 


Regression Example: Predicting Grades 

For a detailed example of a multiple regression, consider the Calculus 
workbook first discussed in Chapter 8, which examined how scores in first- 
semester calculus were related to various measures of student achievement 
in high school (Edge and Friedberg, 1984). 

To open the Calculus workbook: 

1 Start Excel and open the Calculus workbook from the Chapter09 
data folder. 

2 Save the file as Calculus Multiple Regression. 


In Chapter 8, it appeared from the correlation matrix and scatter plot ma¬ 
trix that the algebra placement test is the best individual predictor of the 
first semester calculus score (although it is not very successful). Multiple re¬ 
gression gives a measure of how good the predictors are when used together. 
The model is 

Calculus score = l3o + /3i(Calc HS) -I- ^ 2 (ACT Math) -I- j33(Alg Place) 

-I- ;S4(Alg2 Grade) -I- j8g(HS Rank) -I- j38(Gender Code) -I- e 
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You can use the Analysis ToolPak Regression command to perform a 
multiple regression on the data, hut the predictor variables must occupy a 
contiguous range. You will he using columns A, B, C, D, E, and G as your 
predictor variables, so you need to move column G, Gender Gode, next to 
columns A:E. 

To move column G next to columns A:E: 

1 Glick the G column header to select the entire column. 

2 Right-click the selection to open the shortcut menu; then click Cut. 

3 Glick the F column header. 

4 Right-click the selection to open the shortcut menu; then click Insert 
Cut Cells. You can now identify the contiguous range of columns 
A:F as your predictor variables. 


To perform a multiple regression on the calculus score based on the pre¬ 
dictor variables Calc HS, ACT Math, Alg Place, Alg2 Grade, HS Rank, and 
Gender Code, use the Regression command found in the Analysis ToolPak 
provided with Excel. 

To perform the multiple regression: 

1 Click Data Analysis from the Analysis group on the Data tab, select 
Regression from the Analysis Tools list box, and click OK. 

2 Type Hl:H 81 in the Input Y Range text box, press Tab, and then type 
Al:F 81 in the Input XRange text box. 

3 Click the Labels checkbox and the Confidence Level checkbox to se¬ 
lect them, and then verily that the Confidence Level box contains 95. 

4 Click the New Worksheet Ply option button, click the corresponding 
text box, and then type Multiple Reg. 

5 Click the Residuals, Standardized Residuals, Residual Plots, and 
Line Fit Plots checkboxes to select them. Your Regression dialog box 
should look like Figure 9-2. 
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6 Click OK. 


Excel creates a new sheet, Multiple Reg, which contains the summary 
output and the residual plots. 


Interpreting the Regression Output 

To interpret the output, look first at the analysis of variance (ANOVA) table 
found in cells A10:F14. Figure 9-3 shows this range with the columns wid¬ 
ened to display the labels and the values reformatted. The analysis of vari¬ 
ance table shows you whether the fitted regression model is significant. 

The analysis of variance table helps you choose between two hypotheses. 

Hq! The population coefficients of all six predictor variables = 0 
Hq! At least one of the six population coefficients A 0 
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Figure 9-3 
Multiple 
regression 
ANOVA table 
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There are many different parts to an ANOVA table. At this point, you 
should just concentrate on the F ratio and its p value, which tell you whether 
the regression is significant. This ratio is large when the predictor variables 
explain much of the variability of the response variable, and hence it has a 
small p value as measured by the F distribution. A small value for this ratio 
indicates that much of the variability in y is due to random error (as esti¬ 
mated by the residuals of the model) and is not due to the regression. The 
next chapter, on analysis of variance, contains a more detailed description 
of the ANOVA table. 

The F ratio, 7.197, is located in cell E12. Under the null hypothesis, you 
assume that there is no relationship between the six predictors and the cal¬ 
culus score. If the null hypothesis is true, the F ratio in the ANOVA table 
follows the F distribution, with 6 numerator degrees of freedom and 73 de¬ 
nominator degrees of freedom. You can test the null hypothesis by seeing 
whether this observed F ratio is much larger than you would expect in the 
F distribution. If you want to get a visual picture of this hypothesis test, use 
the F distribution worksheet from the Distributions workbook and display 
the F(6, 73) distribution. 

The Significance F column gives a p value of 4.69 X 10“® (cell F12), re¬ 
presenting the probability that an F ratio with 6 degrees of freedom in the 
numerator and 73 in the denominator has a value 7.197 or more. This is 
much less than .05, so the regression is significant at the 5% level. You 
could also say that you reject the null hypothesis at the 5% level and ac¬ 
cept the alternative that at least one of the coefficients in the regression is 
not zero. If the F ratio were not significant, there would not be much inter¬ 
est in looking at the rest of the output. 


Multiple Correlation 

The regression statistics appear in the range A3:B8, shown in Figure 9-4 
(formatted to show column labels and the values to three decimal places). 
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Figure 9-4 
Multiple 
regression 
statistics 



The R Square value in cell B5 (.372) is the coefficient of determination 

discnssed in the previons chapter. This valne indicates that 37% of the 
variance in calcnlns scores can he attrihnted to the regression. In other 
words, 37% of the variahility in the final calcnlns score is dne to differ¬ 
ences among stndents (as qnantified hy the valnes of the predictor vari- 
ahles] and the rest is dne to random flnctnation. Althongh this valne might 
seem low, it is an nnfortnnate fact that decisions are often made on the 
basis of weak predictor variables, inclnding decisions abont college ad¬ 
missions and scholarships, freshman eligibility in sports, and placement 
in college classes. 

The Mnltiple R (0.610) in cell B4 is jnst the sqnare root of the this 
is also known as the multiple correlation. It is the correlation among 
the response variable, the calcnlns score, and the linear combination of 
the predictor variables as expressed by the regression. If there were only 
one predictor, this wonld be the absolnte valne of the correlation between 
the predictor and the dependent variable. The Adjnsted R Sqnare valne in 
cell B6 (0.320) attempts to adjnst the R^ for the nnmber of predictors. Yon 
look at the adjnsted R^ becanse the nnadjnsted R^ valne either increases or 
stays the same when yon add predictors to the model. If yon add enongh 
predictors to the model, yon can reach some very high R^ valnes, bnt not 
mnch is to be gained by analyzing a data set with 200 observations if the 
regression model has 200 predictors, even if the R^ valne is 100%. Adjnsting 
the R^ compensates for this effect and helps yon determine whether adding 
additional predictors is worthwhile. 

The standard error valne, 9.430 (cell B7), is the estimated valne of cr, the 
standard deviation of the error term e, in other words, the standard devia¬ 
tion of the calcnlns score once you compensate for differences in the predic¬ 
tor variables. Yon can also think of the standard error as the typical error 
for prediction of the 80 calcnlns scores. Becanse a span of 10 points cor¬ 
responds to a difference of one letter grade (A vs. B, B vs. C, and so on), the 
typical error of prediction is abont one letter grade. 
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Coefficients and the Prediction Equation 


At this point you know the model is statistically significant and accounts for 
about 37% of the variahility in calculus scores. What is the regression equa¬ 
tion itself and which predictor variables are most important? 

You can read the estimated regression model from cells A16:I23, shown 
in Figure 9-5, where the first column contains labels for the predictor 
variables. 


Figure 9-5 
Parameter 
estimates 
and p values 



The Coefficients column (B16:B23) gives the estimated coefficients for the 
model. The corresponding prediction equation is 

Calc = 27.943 -t 7.192(Calc HS) -t 0.352(ACT Math) -t 0.827(Alg Place) 

+ 3.683(Alg2 Grade) -t 0.11l(HS Rank) -f 2.627(Gender Code) 

The coefficient for each variable estimates how much the calculus score 
will change if the variable is increased by 1 and the other variables are held 
constant. For example, the coefficient 0.352 of ACT Math indicates that the 
calculus score should increase by 0.352 point if the ACT math score in¬ 
creases by 1 point and all other variables are held constant. 

Some variables, such as Calc HS, have a value of either 0 or 1, in this 
case to indicate the absence or presence of calculus in high school. The co¬ 
efficient 7.192 is the estimated effect on the calculus score of taking high 
school calculus, other things being equal. Because 10 points correspond to 
one letter grade, the coefficient 7.192 for Calc HS is almost one letter grade. 

Using the coefficients of this regression equation, you can forecast what 
a particular student’s calculus score may be, given background information 
on the student. For example, consider a male student who did not take cal¬ 
culus in high school, scored 30 on his ACT Math exam, scored 23 on his 
algebra placement test, had a 4.0 grade in second-year high school algebra, 
and was ranked in the 90th percentile in his high school graduation class. 
You would predict that his calculus score would be 

Calc = 27.943 -t 7.192(0) -t 0.352(30) -H 0.827(23) -H 3.683(4.0) 

-t 0.111(90) + 2.627(1) = 74.87, or about 75 points 
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Notice the Gender Code coefficient, 2.627, which shows the effect of gen¬ 
der if the other variables are held constant. Becanse the males are coded 1 
and the females are coded 0, if the regression model is true, a male student 
will score 2.627 points higher than a female student, even when the back¬ 
grounds of both students are equivalent (equivalent in terms of the predictor 
variables in the model). 

Whether you can trust that conclusion depends partly on whether the 
coefficient for Gender Code is significant. For that you have to determine 
the precision with which the value of the coefficient has been determined. 
You can do this by examining the estimated standard deviations of the coef¬ 
ficients, displayed in the Standard Error column. 


t Tests for the Coefficients 

The t Stat column shows the ratio between the coefficient and the standard 
error. If the population coefficient is 0, then this has the t distribution with 
degrees of freedom n — p — 1= 80 — 6 — 1 = 73. Here n is the number of 
cases (80) and p is the number of predictors (6). The next column, P value, is 
the corresponding p value—the probability of a t value this large or larger in 
absolute value. For example, the t value for Alg Place is 3.092, so the prob¬ 
ability of a t this large or larger in absolute value is about .003. The coeffi¬ 
cient is significant at the 5% level because this is less than .05. In terms of 
hypothesis testing, you would reject the null hypothesis that the coefficient 
is 0 at the 5% level and accept the alternative hypothesis. This is a two-tailed 
test—it rejects the null hypothesis for either large positive or large negative 
values of t —so your alternative hypothesis is that the coefficient is not zero. 
Notice that only the coefficients for Alg Place and Gale HS are significant. 
This suggests that you not devote a lot of effort to interpreting the others. In 
particular, it would not be appropriate to assume from the regression that 
male students perform better than equally qualified female students. 

The range F17:G23 indicates the 95% confidence intervals for each of the 
coefficients. You are 95% confident that having calculus in high school is 
associated with an increase in the calculus score of at least 2.233 points and 
not more than 12.151 points in this particular regression equation. 

Is it strange that the AGT math score is nowhere near significant here, 
even though this test is supposed to be a strong indication of mathemat¬ 
ics achievement? Looking back at the correlation matrix in Ghapter 8, you 
can see that it has correlation 0.353 with Gale, which is highly significant 
(p = .001). Why is it not significant here? The answer involves other vari¬ 
ables that contain some of the same information. In using the t distribution 
to test the significance of the AGT Math term, you are testing whether you 
can get away with deleting this term. If the other predictors can take up the 
slack and provide most of its information, then the test says that this term 
is not significant and therefore is not needed in the model. If each of the 
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predictors can be predicted from the others, any single predictor can be 
eliminated withont losing mnch. 

Yon might think that yon conld jnst drop from the model all the terms 
that are not significant. However, it is important to bear in mind that the in- 
dividnal tests are correlated, so each of them changes when yon drop one of 
the terms. If yon drop the least-significant term, others might then become 
significant. A freqnently nsed strategy for redncing the nnmber of predictors 
involves the following steps: 

1. Eliminate the least-significant predictor if it is not significant. 

2. Refit the model. 

3. Repeat Steps 1 and 2 nntil all predictors are significant. 

In the exercises, yon’ll get a chance to rernn this model and eliminate all 
non significant variables. For now, examine the model and see whether any 
assnmptions have been violated. 


Testing Regression Assumptions 

There are a nnmber of nsefnl ways to look at the resnlts prodnced by mnl- 

tiple linear regression. This section reviews the fonr common plots that can 

help yon assess the snccess of the regression. 

1. Plotting dependent variables against the predicted valnes shows how 
well the regression fits the data. 

2. Plotting residnals against the predicted valnes magnifies the vertical 
spread of the data so yon can assess whether the regression assnmptions 
are jnstified. A cnrved pattern to the residnals indicates that the model 
does not fit the data. If the vertical spread is wider on one side of the 
plot, it snggests that the variance is not constant. 

3. Plotting residnals against individnal predictor variables can sometimes 
reveal problems that are not clear from a plot of the residnals versns the 
predicted valnes. 

4. Creating a normal plot of the residnals helps yon assess whether the 
regression assnmption of normality is jnstified. 


Observed versus Predicted Values 

How snccessfnl is the regression? To see how well the regression fits the 
data, plot the actnal Calcnlns valnes against the predicted valnes stored in 
B29:B109. (Yon can scroll down to view the residnal ontpnt.) To plot the ob¬ 
served calcnlns scores versns the predicted scores, yon mnst first place the 
data on the same worksheet. 
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To copy the observed scores: 

1 Select the range B29:B109 and click the Copy hntton -J from the 
Clipboard gronp on the Home tah. 

2 Click the Calculus Data sheet tah. 

3 Select the range Hl:H81; right-click the selection and click Insert 
Copied Cells from the popup menu to paste the predicted values 
into column H. 

4 Click the Shift Cells Right option button to move the observed cal¬ 
culus scores into column I; then click OK. 

The predicted calculus scores appear in column H, as shown in 
Figure 9-6 (formatted to show the column labels). 


Figure 9-6 
Predicted 
and observed 
calculus 
scores 



Now create a scatter plot of the data in the range 


/ 

1 
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To create the scatter plot of the observed scores versus the predicted 
scores: 

Click Single Variable Charts from the StatPlus menu and then click 

Fast Scatterplot. 

Click the x-axis button and then click the Use Range References 
option button and select the range Hl:H81 from the worksheet. Click 
the OK button. 
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J Click the y-axis button and select Calc from the list of range names 
and click the OK button. 

4 Click the Chart Options button and enter Calculus Scores for the 
chart title, Predicted for the x axis title, and Observed for the y axis 
title. Click the OK button. 

5 Click the Output button and save the chart to the Observed vs. 
Predicted chart sheet. Click OK. 

6 Click the OK button to generate the scatter plot. 

7 Rescale the x axis and y axis in the plot so that the ranges go from 40 
to 100 rather than from 0 to 100 or 0 to 120. 

The final form of the scatter plot should look like Figure 9-7. 


Figure 9-7 
Scatter plot 
of observed 
and predicted 
scores 



How good is the prediction shown here? Is there a narrow range of ob¬ 
served values for a given predicted value? This plot is a slight improvement 
on the plot of Calc versus Alg Place from the scatter plot matrix in Chapter 8. 
Figure 9-7 should be better because Alg Place and five other predictors are 
being used here. 
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Does it appear that the range of values is narrower for large values of pre¬ 
dicted calculus score? If the error variance were lower for students with high 
predicted values, it would he a violation of the third regression assumption, 
which requires a constant error variance. Consider the students predicted to 
have a grade of 80 in calculus. These students have actual grades of around 
65 to around 95, a wide range. Notice that the variation is lower for students 
predicted to have a grade of 90. Their actual scores are all in the 80s and 90s. 
There is a harrier at the top—no score can he above 100—and this limits the 
possible range. In general, when a barrier limits the range of the dependent 
variable, it can cause nonconstant error variance. This issue is considered 
further in the next section. 

Plotting Residuals versus Predicted Values 

The plot of the residuals versus the predicted values shows another view 
of the variation in Figure 9-7 because the residuals are the differences be¬ 
tween the actual calculus scores and the predicted values. 

To make the plot: 

1 Click the Multiple Regression sheet tab to return to the regression 
output. 

2 Create a scatter plot of the Residuals in the cell range C29:C109 ver¬ 
sus Predicted Values in the cell range B29:B109 using either the 
StatPlus Fast Scatterplot command or using Excel’s built-in com¬ 
mands to create a scatter plot. 

3 Specify a chart title of Residual Plot, and label the x axis Predicted 
Calculus Scores and the y axis Residuals. Save the scatter plot to a 
chart sheet named Residuals vs. Predicted. 

4 Change the scale of the x axis from 0-100 to 60-100. Your chart 
sheet should look like Figure 9-8. 
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Figure 9-8 
Scatter plot 
of residuals 
and predicted 
scores 



This plot is useful for verifying the regression assnmptions. For example, 
the first assnmption reqnires that the form of the model he correct. A viola¬ 
tion of this assnmption might he seen in a cnrved pattern. No cnrve is ap¬ 
parent here. 

If the assnmption of constant variance is not satisfied, then it shonld he 
apparent in Fignre 9-8. Look for a trend in the vertical spread of the data. 
For example, the data may widen ont as the predicted valne increases. 
There appears to he a definite trend toward a narrower spread on the right, 
and it is canse for concern ahont the validity of the regression—althongh 
regression does have some rohnstness with respect to the assnmption of 
constant variance. 

For data that range from 0 to 100 (snch as percentages], the arcsine- 
sqnareroot transformation sometimes helps fix problems with nonconstant 
variance. The transformation involves creating a new colnmn of trans¬ 
formed calcnlns scores where 

Transformed calc score = sin “^\/calcnlns score/100 
Using Excel, yon wonld enter the formnla 

= ASIN(SQRT(x/100)) 

where x is the valne or cell reference of a valne yon want to transform. 

If yon were to apply this transformation here and nse the transformed cal- 
cnlus score in the regression in place of the nntransformed score, yon wonld 
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find that it helps to make the variance more constant, hnt the regression 
results are about the same. Calc HS and Alg Place are still the only significant 
coefficients, and the value is almost the same as before. Of course, it is 
much harder to interpret the coefficients after transformation. Who v\rould 
understand if you said that each point in the algebra placement score is w^orth 
0.012 point in the arcsine of the square root of the calculus score divided by 
100? From this perspective, the transformed regression is useful mainly to 
validate the original regression. If it is valid and it gives essentially the same 
results as the original regression, then the original results are valid. 


Plotting Residuals versus Predictor Variables 

It is also useful to look at the plot of the residuals against each of the predic¬ 
tor variables because a curve might show up on only one of those plots or 
there might be an indication of nonconstant variance. Such plots are created 
automatically with the Analysis ToolPak Add-Ins. 

To view one of these plots: 

I Click the Multiple Regression sheet tab to return to the regression 
output. 


The plots generated by the add-in start in cell Jl and extend to cell Z32. 
Two types of plots are generated: scatter plots of the regression residuals ver¬ 
sus each of the regression variables, and the observed and predicted values of 
the response variable (calculus score) against each of the regression variables. 
See Figure 9-9. (You might have to scroll up and right to see the charts.) 


Figure 9-9 
Plots 
created 
with the 
Regression 
command 
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The plots are shown in a cascading format, in which the plot title is often 
the only visible element of a chart. When yon click a chart title, the chart 
goes to the front of the stack. The charts are small and hard to read, how¬ 
ever. Yon can better view each chart by placing it on a chart sheet of its own. 
Try doing this with the plot of the residnals versns Alg Place. 

To view the chart: 

1 Click the chart Alg Place Residual Plot (located in the range L5:Q14). 

2 Click the Move Chart bntton from the Location gronp on the Design 
tab of the Chart Tools ribbon. 

3 Click the As new sheet option bntton and type Alg Place Residual 
Plot in the accompanying text box. Click OK. 

The scatter plot is moved to a chart sheet shown in Fignre 9-10. 


Figure 9-10 
Alg Place 
residual plot 
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Does the spread of the residnal valnes appear constant for differing 
valnes of the algebra placement score? It appears that the spread of the re¬ 
sidnals is wider for lower valnes of Alg Place. This might indicate that yon 
have to transform the data, perhaps nsing the arcsine transformation jnst 
discnssed. 
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Normal Errors and the Normal Plot 


What about the assumption of normal errors? Usually, if there is a problem 
with non normal errors, extreme values show up in the plot of residuals ver¬ 
sus predicted values. In this example there are no residual values beyond 25 
in absolute value, as shown in Figure 9-8. 

How large should the residuals be if the errors are normal? You can de¬ 
cide whether these values are reasonable with a normal probability plot. 


1 

2 

3 

4 

5 


To make a normal plot of the residuals: 

Return to the Multiple Regression worksheet. 

Click Single Variable Charts from the StatPlus menu and then click 

Normal P-plots. 

Click the Data Values button, click the Use Range References option 
button, and select the range C29:C109. Click OK. 

Click the Output button and specify the new chart sheet Residual 
P-plot as the output destination. Click OK. 

Click OK to start generating the plot. See Figure 9-11. 


Figure 9-11 
Normal 
P-plot of 
the residuals 



The plot is quite well behaved. It is fairly straight, and there are no extreme 
values (either in the upper right or lower left corners) at either end. It appears 
there is no problem with the normality assumption. 
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Summary of Calc Analysis 


What main conclusions can you make about the calculus data, now that you 
have done a regression, examined the regression residual file, and plotted 
some of the data? With an of 0.37 and an adjusted of 0.320, the regres¬ 
sion accounts for only about one-third of the variance of the calculus score. 
This is disappointing, considering all the weight that college scholarships, 
admissions, placement, and athletics place on the predictors. Only the al¬ 
gebra placement score and whether calculus was taken in high school have 
significant coefficients in the regression. There is a slight problem with the 
assumption of a constant variance, but that does not affect these conclu¬ 
sions. You can close your workbook now, saving your changes. 


Regression Example: Sex Discrimination 

In this next example, you use regression analysis to determine whether a 
particular group is being discriminated against. For example, some of the 
female faculty at a junior college felt underpaid, and they sought statisti¬ 
cal help in proving their case. The college collected data for the variables 
that influence salary for 37 females and 44 males. The data are stored in the 
Discrimination workbook. 

To open the file: 

□pen Discrimination from the Chapter09 data folder. 

Save the workbook as Discrimination Mnltiple Regression. 


1 

2 


Table 9-1 shows the variables in the workbook. 
Table 9-1 The Discrim Workbook 


Range Name 

Range 

Description 

Gender 

A2:A82 

Gender of faculty member (F = female, M = male) 

MS_Hired 

B2:B82 

1 for Master’s degree when hired, 0 for no Master’s 
degree when hired 

Degree 

C2:C82 

Current degree: 1 for Bachelor’s, 2 for Master’s, 

3 for Master’s plus 30 hours, and 4 for PhD 

Age_Hired 

D2:D82 

Age when hired 

Years 

E2:E82 

Number of years the faculty member has been 
employed at the college 

Salary 

F2:F82 

Current salary of faculty member 
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In this example, you use salary as the dependent variable, using four 
other quantitative variables as predictors. One w^ay to see whether female 
faculty have been treated unfairly is to do the regression using just the male 
data and then apply the regression to the female data. For each female fac¬ 
ulty member, this predicts what a male faculty member would make with 
the same years, age when hired, degree, and Master’s degree status. The re¬ 
siduals are interesting because they are the difference between what each 
woman makes and her predicted salary if she were a man. This assumes that 
all of the relevant predictors are being used, but it would be the college’s re¬ 
sponsibility to point out all the variables that influence salary in an impor¬ 
tant way. When there is a union contract, which is the case here, it should 
be clear which factors influence salary. 


Regression on Male Faculty 

To do the regression on just the male faculty and then look at the residuals 
for the females, use Excel’s AutoFilter capability and copy the male rows to 
a new worksheet. 

To create a worksheet of salary information for male faculty only: 

1 Right-click the Salary Data sheet tab to open the pop-up menu and 
then click Insert from the menu. 

2 Click Worksheet from the General sheet of the Insert dialog box and 
click OK. 

3 Double-click the new sheet tab and type Male Faculty. Return to the 
Salary Data worksheet. 

4 Click the Filter button from the Sort & Filter group on the Data 
tab. Excel adds drop-down arrows to all of the column headers in 
the list. 

5 Click the Gender drop-down arrow; then deselect all of the check¬ 
boxes except for M and click the OK button. Excel displays only the 
data for the male faculty. 

6 Select the range Al:F82; then click the Copy button from the Clip¬ 
board group on the Home tab. 

7 Go to cell Al on the Male Faculty worksheet and click the Paste 
button from the Clipboard group on the Home tab. The salary data 
for male faculty now occupy the range Al:F45 on the Male Faculty 
worksheet. 
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Using a SPLOM to See Relationships 


To get a sense of the relationships among the variables for the male facnlty, 
it is a good idea to compnte a correlation matrix and plot the corresponding 
scatter plot matrix. 

To create the SPLOM: 

1 Click Multi-variable Charts from the StatPlns menn and then click 

Scatterplot Matrix. 

2 Click the Data Values button, click the Use Range References 
option button, and select the range Bl:F45. Click OK. 

3 Click the Output button, click the New Worksheet option button, 
and type Male SPLOM in the accompanying text box. Click OK. 

4 Click OK to start generating the scatter plot matrix. See Figure 9-12 
for the completed SPLOM. 


Figure 9-12 
SPLOM of 
variables 
for male 
faculty 
salary 
data 
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Focus on the last row because it shows the relationships of the other 
variables to salary. Years employed is a good predictor because the range 
of salary is fairly narrow for each value of years employed (although the 
relationship is not perfectly linear). Age at which the employee was hired is 
not a very good predictor because there is a wide range of salary values for 
each value of age hired. There is not a significant relationship between the 
two predictors years employed and age hired. What about the other two 
predictors? Looking at the plots of salary against degree and MS hired 
makes it clear that neither of them is closely related to salary. The peo¬ 
ple with higher degrees do not seem to be making higher salaries. Those 
with a Master’s degree when hired do not seem to be making much more 
either. Therefore, the correlations of degree and MS hired with salary 
should be low. 

You might have some misgivings about using Degree as a predictor. 
After all, it is only an ordinal variable. There is a natural order to the four 
levels, but it is arbitrary to assign the values 1, 2, 3, and 4. This says that 
the spacing from Bachelor’s to Master’s (1 to 2) is the same as the spacing 
from Master’s plus 30 hours to PhD (3 to 4). You could instead assign the 
values 1, 2, 3, and 5, which would mean greater space from Master’s plus 
30 hours to the PhD. In spite of this arbitrary assignment, ordinal variables 
are frequently used as regression predictors. Usually, it does not make a 
significant difference whether the numbers are 1, 2, 3, and 4 or 1, 2, 3, 
and 5. In the present situation, you can see from Figure 9-12 that salaries 
are about the same in all four degree categories, which implies that the 
correlation of salary and degree is close to 0. This is true no matter what 
spacing is used. 


Correlation Matrix of Variables 


The SPLOM shows the relationships between salary and the other vari¬ 
ables. To quantify this relationship, create a correlation matrix of the 
variables. 


To form the correlation matrix: 

1 Click the Male Faculty sheet tab. 

2 Click Multivariate Analysis from the StatPlus menu and then click 

Correlation Matrix. 

3 Click the Data Values button, click the Use Range References option 
button, select the range Bl:F45, and click OK. 
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Click the Output button, click the New Sheet option button, and 
type Male Corr Matrix in the New Sheet text box; then click OK 
twice. The resulting correlation matrix appears on its own sheet, as 
shown in Figure 9-13. 


Figure 9-13 
Correlation 
matrix for 
male faculty 
salary data 
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You might wonder why the variable Age Hired is used instead of em¬ 
ployee age. The problem with using employee age is one of collinearity. 
Collinearity means that one or more of the predictor variables are highly 
correlated with each other. In this case, the age of the employee is highly 
correlated with the number of years employed because there is some over¬ 
lap between the two. (People who have been employed more years are likely 
to be older.) This means that the information those two variables provide 
is somewhat redundant. However, you can tell from Figure 9-13 that the 
relationship between years employed and age when hired is negligible be¬ 
cause the p value is .681 (cell E13). Using the variable age hired instead of 
age gives the advantage of having two nearly uncorrelated predictors in the 
model. When predictors are only weakly correlated, it is much easier to in¬ 
terpret the results of a multiple regression. 

The correlations for Salary show a strong relationship to the number of 
years employed and some relationship to age when hired, but there is little 
relationship to a person’s degree. This is in agreement with the SPLOM in 
Figure 9-12. 
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Multiple Regression 


What happens when yon throw all fonr predictors into the regression pot? 


To specify the model for the regression: 

1 Click the Male Faculty sheet tah. 

2 Click the Data Analysis button from the Analysis group on the Data 
tah, select Regression from the list of Analysis Tools, and click OK. 
The Regression dialog hox might contain the options you selected 
for the previous regression. 

3 Type Fl:F45 in the Input T Range text hox, press Tah, and then type 
Bl:E45 in the Input X Range text hox. 

4 Verify that the Labels checkbox is selected and that the Confidence 
Level checkbox is selected and contains a value of 95. 

5 Click the New Worksheet Ply option button and type Male Faculty 
Regression in the corresponding text box (replace the current con¬ 
tents if necessary). 

6 Verify that the Residuals, Standardized Residuals, Residual Plots, 
and Line Fit Plots checkboxes are selected. 

7 Click OK. 

The first portion of the summary output is shown in Figure 9-14, 
with columns resized and values reformatted. 
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Interpreting the Regression Output 


The of 0.732 shows that the regression explains 73.2% of the variance in 
salary. However, when this is adjusted for the numher of predictors (four), 
the adjusted is about 0.705=70.5%. The standard error is 3,168.434, so 
salaries vary roughly plus or minus $3,000 from their predictions. The over¬ 
all F ratio is about 26.67, with a p value in cell F12 of 1.063 X 10“^°, which 
rules out the hypothesis that all four population coefficients are 0. Looking 
at the coefficient values and their standard errors, you see that the coeffi¬ 
cients for the variables Degree and MS Hired have values that are not much 
more than 1 times their standard errors. Their t statistics are much less than 
2, and their p values are much more than .05; therefore, they are not sig¬ 
nificant at the 5% level. On the other hand. Years employed and Age Hired 
do have coefficients that are much larger than their standard errors, with t 
values of 9.39 and 4.49, respectively. The corresponding p values are signifi¬ 
cant at the 0.1% level. 

The coefficient estimate of 606 for years employed indicates that each 
year on the job is worth $606 in annual salary if the other predictors are 
held fixed. Correspondingly, because the coefficient for Age Hired is about 
$374, all other factors being equal, an employee who was hired at an age 
1 year older than another employee will be paid an additional $374. 

Residual Analysis of Discrimination Data 

Now check the assumptions under which you performed the regression. 

To create a plot of residuals versus predicted salary values: 

1 Using either the StatPlus Fast Scatterplot command or the Excel’s 
Scatter button on the Insert menu, create a scatter plot of the 
Residual values in the cell range C27:C71 versus Predicted Val¬ 
ues in the range B27:B71. 

2 Enter a chart title of Residual Plot, and label the x axis Predicted 
Salaries and the y axis Residuals. Save the scatter plot to a chart 
sheet named Residuals vs. Predicted. 

3 Change the scale of the x axis from 0-45,000 to 20,000-45,000. Your 
chart sheet should look like Figure 9-15. 
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There does not appear to be a problem with nonconstant variance. At 
least, there is not a big change in the vertical spread of the residnals as yon 
move from left to right. However, there are two points that look qnestion- 
able. The one at the top has a residnal valne near 8,000 (indicating that this 
individnal is paid $8,000 more than predicted from the regression eqna- 
tion), and at the bottom of the plot an individnal is paid abont $6,000 less 
than predicted from the regression. 

Except for these two, the points have a somewhat cnrved pattern—high 
on the ends and low in the middle—of the kind that is sometimes helped by 
a log transformation. As it tnrns ont, the log transformation wonld straighten 
ont the plot, bnt the regression resnlts wonld not change mnch. For example, 
if log(salary) is nsed in place of salary, the valne changes only from 
0.732 to 0.733. When the results are unaffected by a transformation, it is 
best not to bother because it is much easier to interpret the untransformed 
regression. 


Normal Plot of Residuals 

What about the normality assumption? Are the residuals reasonably in 
accord with what is expected for normal data? 
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To create a normal probability plot of the residuals: 

1 Click the Male Faculty Regression sheet tah. 

2 Click Single Variable Charts from the StatPlus menu and then click 

Normal P-plots. 

3 Click the Data Values button, click the Use Range References option 
button, and select the range C27:C71. Click OK. 

4 Click the Output button and specify the new chart sheet Male 
Residual P-plot as the output destination. Click OK. 

5 Click OK to start generating the plot. See Figure 9-16. 



The plot is reasonably straight, although there is a point at the upper 
right that is a little farther to the right than expected. This point belongs to 
the employee whose salary is $8,000 more than predicted, but it does not 
appear to be too extreme. You can conclude that the residuals seem consis¬ 
tent with the normality assumption. 
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Are Female Faculty Underpaid? 


Being satisfied with the validity of the regression on males, let’s go ahead 
and apply it to the females to see whether they are nnderpaid. The idea is 
to look at the differences between what female facnlty members were paid 
and what we wonld predict they wonld be paid on the basis of the regres¬ 
sion model for male facnlty. Yonr nltimate goal is to choose between two 
hypotheses. 

Hg! The mean popnlation salaries of females are eqnal to the salaries pre¬ 
dicted from the popnlation model for males. 

The mean popnlation salaries of females are lower than the salaries 
predicted from the popnlation model for males. 

To obtain statistics on the salaries for females relative to males, yon mnst 
create new colnmns of predicted valnes and residnals. 

To create new columns of predicted values and residuals: 

1 Create a new blank worksheet named Female Faculty and then go to 
the Salary Data worksheet. 

2 Click the Gender drop-down list arrow and select only the F 
checkbox. 

Verify that the range Al:F38 displaying data on only the female fac¬ 
nlty is displayed in the worksheet. 

3 Copy the selection and paste it to cell Al on the Female Facnlty 
worksheet. 

4 In the Female Data worksheet, click cell Gl, type Predicted Salary 
press Tab, type Residuals and then press Enter. 

5 Select the range G2:H38. 

6 In cell G2, type 

=12900.67-1-744.4821*B2-783.529*C2-1-373.7354*02-l-606.1759*E2 

(the regression eqnation for males], and press Tab. 

7 Type =F2-G2 in cell H2, and press Enter. 

8 Select the cell range F2:H38 and then click the Fill bntton 3 from the 
Editing gronp on the Home tab and click Down. Excel inserts the for- 
mnla into the remaining cells in the two colnmns. The data shonld 
appear as in Fignre 9-17. 
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To see whether females are paid about the same salary that would he pre¬ 
dicted if they were males, create a scatter plot of residuals versus predicted 
salary. 

To create the scatter plot: 

1 Using either the StatPlus Fast Scatterplot command or the Scatter- 
plot button on the Insert tab, create a scatterplot of the Residnals 
(Hl:H38) versns Predicted Salary (Gl:G38). 

2 Give a chart title of Female Residual Plot, and label the x axis 
Predicted Salaries and the _y axis Residuals. Save the chart to a chart 
sheet named Female Residual Plot. 

3 Ghange the scale of the x axis from 0-45,000 to 20,000-40,000. Your 
chart sheet should resemble that shown in Figure 9-18. 
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Figure 9-18 
Scatter plot 
of residuals 
versus 
predicted 
values for 
female 
faculty 



Out of 37 female faculty, only 5 have salaries greater than what would he 
predicted if they were males, whereas 32 have salaries less than predicted. 
Calculate the descriptive statistics for the residuals to determine the average 
discrepancy in salary. 

To calculate descriptive statistics for female faculty’s salaries: 

1 Click the Female Faculty worksheet tah. 

2 Click Descriptive Statistics from the StatPlus menu and then click 

Univariate Statistics. 

3 Click the All summary statistics and All variability statistics 

checkboxes. 

4 Click the Input button, click the Use Range References option button, 
and select the range Hl:H38. Click OK. 

5 Click the Output button, and select the new worksheet Female 
Residual Stats as the output destination. Click OK. 

6 Click OK to generate the table of descriptive statistics shown in 
Figure 9-19. 
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Figure 9-19 
Descriptive 
statistics of 
the residuals 
for female 
faculty 



On the basis of the descriptive statistics, you can conclude that the female 
faculty are paid, on average, $3,063.64 less than what would be expected 
for equally qualified male faculty members (as quantified by the predictor 
variables). The largest discrepancy is for a female faculty member who is 
paid $8,825 less than expected (cell B9). Of those with salaries greater than 
predicted, there is a female faculty member who is paid $2,090 more than 
expected (cell BlO). 

To understand the salary deficit better, you can plot residuals against the 
relevant predictor variables. Start by plotting the female salary residuals 
versus age when hired. (You could plot residuals versus years employed, 
but you would see no particular trend in the pattern of the residuals.) 

To plot the residuals against Age Hired: 

1 Click the Female Faculty sheet tab to return to the data worksheet. 

2 Using either the StatPlus Fast Scatterplot command or the Scatter- 
chart button on the Insert tab of the Excel ribbon, create a scatter 
plot of Residuals (Hl:H38) versus Age Hired (Dl:D38). 
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Figure 9-20 
Scatter plot 
of residuals 
versus Age 
Hired for 
female 
faculty 


3 Give a chart title of Residuals vs. Age Hired, and label the x axis 
Age Hired and the y axis Residuals. Save the chart in a new chart 
sheet named Female Resid vs. Age Hired. 

4 Change the scale of the x axis from 0-60 to 20-50. Your chart should 
look like Figure 9-20. 



There seems to be a downward trend to the scatterplot, indicating that the 
greater discrepancies in salaries occur for older female faculty. Add a linear 
regression line to the plot, regressing residuals versus age when hired. 

To add a linear regression line to the plot: 

1 Right-click the data series (any one of the data points in Figure 9-20), 
and click Add Trendline in the shortcut menu. 

2 Verify that the Linear Trend/Regression Type option is selected, and 
then click Close. Your plot should now look like Figure 9-21. 
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Figure 9-21 
Scatter plot 
with trend 
line added 



This plot shows a salary deficiency that depends very mnch on the age at 
which a female was hired. Those who were hired nnder the age of 25 have 
residnals that average aronnd 0 or a little helow. Those who were hired over 
the age of 40 are nnderpaid hy more than $5,000 on average. The most nn- 
derpaid female has a deficit of nearly $9,000. 


Drawing Conclusions 

Why shonld age make a difference in the discrepancies? One possibility is 
that women are more likely than men to take time off from their careers to 
raise their children. If this is the case, an older male facnlty member wonld 
have more job experience and thns be paid more. However, this might not 
be true of all women, yet all of the females who were hired over the age of 
36 were underpaid. 

To summarize, the female faculty are underpaid an average of about 
$3,000. However, there is a big difference depending on how old they were 
when hired. Those who were hired after the age of 40 have an average deficit 
of more than $5,000. It should be noted that when the case was eventually 
settled out of court, each woman received the same compensation, regard¬ 
less of age. You can now close the workbook, saving your changes. 
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Exercises 


1. Use Excel’s FINV function to calcnlate 
the critical valne for the following 

F distrihntions (assnme that the 
pvalne = .05): 

a. Nnmerator degrees of freedom = 1; 
denominator degrees of freedom = 9. 

b. Nnmerator degrees of freedom = 2; 
denominator degrees of freedom = 9. 

c. Nnmerator degrees of freedom = 3; 
denominator degrees of freedom = 9. 

d. Nnmerator degrees of freedom = 4; 
denominator degrees of freedom = 9. 

e. Nnmerator degrees of freedom = 5; 
denominator degrees of freedom = 9. 

2. Use Excel’s FDIST fnnction to calcnlate the 
p valne for the following F distrihntions 
(assnme that the critical valne = 3.5): 

a. Nnmerator degrees of freedom = 1; 
denominator degrees of freedom = 9. 

b. Nnmerator degrees of freedom = 2; 
denominator degrees of freedom = 9. 

c. Nnmerator degrees of freedom = 3; 
denominator degrees of freedom = 9. 

d. Nnmerator degrees of freedom = 4; 
denominator degrees of freedom = 9. 

e. Nnmerator degrees of freedom = 5; 
denominator degrees of freedom = 9. 

3. Which of the following models can he 
solved nsing linear regression? Jnstify 
yonr answers. 

a. 7 = ;So + + / 32 X 2 + a 


c. y = Pq + (3^ sin x + /Ha cos x + e 

4. What is collinearity? 

5. The Wheat workbook contains nntritional 
data on ten different wheat prodncts. 
Yon’ve heen asked to determine the 


relationship between calories, carbohy¬ 
drates, protein, and fat. 

a. Open the Wheat workbook from the 
Chapter09 folder and save it as Wheat 
Multiple Regression. 

b. Generate the correlation matrix for 
the variables Calories, Carbo-Fiber 
(the carbohydrate valne minns the 
fiber valne). Protein, and Fat. Also 
create the corresponding scatterplot 
matrix. 

c. Regress Calories on the other three 
variables and obtain the residnal ont- 
pnt. How snccessfnl is the regression? 

It is known that carbohydrates (once 
adjnsted for fiber content) have 4 calo¬ 
ries per gram, protein has 4 calories per 
gram, and fats have 9 calories per gram. 
How do the coefficients compare with 
the known valnes? 

d. Explain why the coefficient for fat is 
inaccnrate, in terms of its standard 
error and in comparison with the 
known valne of 9. [Hint: Examine the 
data and notice that the fat content is 
specified with the least precision.) 

e. Plot the residnals against the pre¬ 
dicted valnes. Is there an ontlier? 
Label the points of the scatter plot 
by food brand to see which case is 
most extreme. Do the calories add 
np correctly for this case? That is, 
when yon mnltiply the carbohydrate 
content (adjnsted for fiber content) 
by 4, the protein content by 4, and 
the fat content by 9, does it add np 
to more calories than are stated on 
the package? Notice also that an¬ 
other case has close to the same val¬ 
nes of Carbo-Fiber, Protein, and Fat, 
bnt the Calories valne is 10 higher. 
How do yon explain this? Wonld 
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a company understate the calorie 
content? 

f. Save your changes to the workbook 
and write a report summarizing your 
observations. 

6. The Fritos workbook is a slight modi¬ 
fication of the Wheat workbook with 
data added about Fritos corn chips. It is 
included because it has a substantial fat 
content, in contrast to the other foods in 
the data set. Because none of the foods 
there have much fat, it is impossible to 
see from the Wheat workbook how much 
fat contributes to the calories in the foods. 

a. Open the Fritos workbook from the 
Chapter09 folder and save it as Fritos 
Multiple Regression. 

b. Repeat the regression of the previous 
exercise and see whether the co¬ 
efficient for fat is now estimated more 
accurately. Use both the known value 
of 9 for comparison and the standard 
error of the regression that is printed 
in the output. 

c. Save your changes to the workbook 
and write a report summarizing your 
observations. 

7. The Baseball workbook contains team 
statistics for each of the major league 
teams from the 2001-2007 baseball sea¬ 
sons. You’ve been asked to derive an 
equation that predicts the number of 
runs per game on the basis of the num¬ 
ber of singles, doubles, triples, home 
runs, bases on balls, and strike outs. 

a. Open the Baseball workbook from 
the ChapterOO folder and save it as 
Baseball Multiple Regression. 

b. Create six new columns in the Base¬ 
ball Stats worksheet and calculate the 
average number of singles, doubles, 
triples, home runs, bases on balls, and 
strikeouts per game for each of the 
teams in the data set. 


c. Regress Runs per Game on the six 
variables you created to derive an 
equation for the average number of 
runs per game on the basis of the 
average number of singles, doubles, 
triples, home runs, bases on balls, 
and strike outs. Are all of the vari¬ 
ables in your equation significant? 
Remove any insignificant variables 
from your model and rerun the regres¬ 
sion. Compare your results with the 
results obtained by Rosner and Woods 
(1988), as quoted in the beginning of 
this chapter. Can the differences be 
explained in terms of the standard 
errors of the coefficients? 

d. Do the Rosner-Woods coefficients 
make more sense in terms of which 
should be largest and which should 
be smallest? 

e. Save your changes to the workbook 
and write a report summarizing your 
results. 

8. The Toyota workbook contains price, 
age, and mileage data for used car sales 
of Toyota Corollas from 2009. You’ve 
been asked to analyze the data to model 
the effect of age and mileage on the used 
car price. 

a. Open the Toyota workbook from the 
Chapter09 data folder and save it as 
Toyota Multiple Regression. 

b. Regress price on age and miles. What 
impact do age and miles have on the 
sale price of the car? Are both vari¬ 
ables significant in your regression 
equation? 

c. Create plots of Residuals versus 
Miles, Residuals versus Age, and 
Residuals versus Predicted Price. 

Do you notice any pattern in 

the graphs that would indicate a 
problem with the constant vari¬ 
ance assumption or the linearity 
assumption? 
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d. Create a normal plot of the residu¬ 
als. Does your plot support a conclu¬ 
sion that the residuals are normally 
distributed? 

e. Save your changes to the workhook 
and write a report summarizing your 
observations. 

9. The regression performed in the previ¬ 
ous exercise assumed that prices would 
change linearly with miles and age. It 
could also be the case that the prices 
will change instead as a percentage so 
that instead of dropping $1000 per year, 
the price would drop 10% per year. You 
can check this assumption by perform¬ 
ing a logarithm of the used car sales 
price. 

a. Open the Toyota workbook from the 
Chapter09 folder and save it as Toyota 
Log Regression. 

b. Create a new variable named LogPrice 
equal to the log(price] value. 

c. Repeat the regression from the last ex¬ 
ercise using the log(price] rather than 
price. 

d. Does this improve the multiple cor¬ 
relation? Have the p values associated 
with the miles and age coefficients 
become more significant? 

e. When log(price) is used as the 
dependent variable, the regression 
can be interpreted in terms of 
percentage drop in Price per year of 
age, instead of a fixed drop per year 
of Age when Price is used as the de¬ 
pendent variable. Does it make more 
sense to have the price drop by 16.5% 
each year or to have the price drop by 
$721 per year? In particular, would an 
old car lose as much value per year as 
it did when it was young? 

f. Save your changes to the workbook 
and write a report summarizing your 
conclusions. 


10 . The Cars workbook contains data based 
on reviews published in Consumer 
Reports®, 2003-2008. See Exercise 10 
of Chapter 2. The workbook includes 
observations from 275 car models on the 
variables Price, MPG (miles per 
gallon), Cyl (number of cylinders), 

Eng size (engine displacement in liters), 
Eng type (normal, hybrid, turbo, turbod¬ 
iesel), HP (horsepower). Weight (vehicle 
weight in pounds), TimeO-60 (time to 
accelerate from 0 to 60 miles per hour in 
seconds). Date (month of publication), 
and Region (United States, Europe, or 
Asia). There is an additional variable 
Eng typeOl that is 1 for hybrids and 
diesels and 0 otherwise. 

a. Open the Cars workbook from the 
Chapter09 folder and save it as Cars 
Multiple Regression. 

b. Create a correlation matrix (excluding 
Spearman’s rank correlation) and a 
scatter plot matrix of the seven quan¬ 
titative variables Price, MPG, Cyl, Eng 
size, HP, Weight, and Eng typeOl. 

c. Regress MPG on Cyl, Eng size, HP, 
Weight, Price, and Eng typeOl. 

d. Note that the regression coefficients 
for Cyl and Eng size are not signifi¬ 
cant at the .05 level. Compare this to 
the p values for these variables in the 
correlation matrix. What accounts for 
the lack of significance? [Hint: Look at 
the correlations among Cyl, Eng size, 
HP, Price, and Weight.) 

e. Create a scatter plot of the regression 
residuals versus the predicted values. 
Judging by the scatter plot, do the as¬ 
sumptions of the regression appear to 
be violated? Why or why not? 

f. Create a new variable, GPMlOO, that 
displays 100 divided by the miles per 
gallon. This measures the fuel neces¬ 
sary to go 100 miles. Some statisti¬ 
cians and car magazines use this 
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because it gives a direct measure 
of the energy consumption. Redo 
yonr regression model with this 
new dependent variable in place of 
MPG. How does the residnal versns 
predicted valne plot compare to the 
earlier one? Compare the regression 
resnlts with the previons resnlts for 
MPG (inclnding R sqnared). 
g. Save yonr changes to the workbook 
and write a report snmmarizing yonr 
conclnsions. 

11. Retnrn to the Cars workbook and per¬ 
form the following analysis: 

a. Open the Cars workbook from the 
Chapter09 folder and save it as Cars 
Reduced Model. 

b. Recreate the GPMlOO variable de¬ 
scribed in the previons exercise and 
then regress GPMlOO on the same 
nnmeric variables. Try to rednce the 
nnmber of variables in the model 
nsing the following algorithm: 

i. Perform the regression. 

ii. If any coefficients in the regres¬ 
sion are nonsignificant, redo the 
regression with the least signifi¬ 
cant variable removed. 

iii. Continne nntil all coefficients 
remaining are significant. 

To do this, yon may have to move 
the colnmns aronnd becanse the 
Regression command reqnires 
that all predictor variables lie in 
adjacent colnmns. 

c. How does the valne for this re- 
dnced model compare to the full 
model with six predictors? 

d. Report your final model and save yonr 
changes to the workbook. 

12. Perform the following final analysis on 
the Cars data: 

a. Reopen the Cars workbook from the 
Chapter09 folder and save it as Cars 
Final Analysis. 


b. Regress the variable GPMlOO 
described in Exercise 11 on Cyl, 

Eng size, HP, Weight, Price, and 
Eng typeOl for only the U.S. cars. 

(Yon will have to copy the data to a 
new worksheet nsing the AntoFilter 
function.) 

c. Analyze the residnals of the model. 

Do they follow the assnmptions rea¬ 
sonably well? 

d. In the Car Data worksheet, add a new 
colnmn containing the predicted 
GPMlOO valnes for all car models 
nsing the regression eqnation yon 
created for only the U.S. cars. Create 
another new colnmn containing the 
residnals. 

e. Plot the residnals against the predicted 
valnes for all of the cars, and then 
break down the scatter plot into cat¬ 
egories on the basis of origin. Rescale 
the X axis so that it ranges from 3 to 8. 

f. Calcnlate descriptive statistics (in- 
clnde the snmmary, variability, and 
95% t-confidence intervals) for the 
residnals colnmn, broken down by 
region. 

g. Save yonr changes to the workbook 
and write a report snmmarizing yonr 
conclnsions, inclnding a discnssion 
of whether Asian and Enropean cars 
appear to have better MPG after cor¬ 
rection for the other factors. Becanse 
the model was developed for U.S. 
cars, the average residnal for U.S. cars 
will be 0. If a car has a negative re¬ 
sidnal for GPMlOO, then the car nses 
less energy than was predicted for it, 
and therefore it gets better gas mileage 
than predicted (the valne predicted for 
a U.S. car). 

13. The Temperatnres workbook contains 
average Jannary temperatnres for 56 cities 
in the United States, along with the cities’ 
latitnde and longitnde. Perform the fol¬ 
lowing analysis of the data: 
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a. Open the Temperatures workbook 
from the Chapter09 folder and save it 
as Temperatures Regressiou. 

b. Create a chart sheet containing a 
scatterplot of latitude vs. longitude. 
Modify the scales for the horizontal 
and vertical axes to go from 60 to 
120 degrees in longitude and from 
20 to 50 degrees in latitude. Reverse 
the orientation of the x-axis so that 
it starts from 120 degrees on the left 
and goes down to 60 degrees on the 
right. Add labels to the points, show¬ 
ing the temperature for each city. 

c. Construct a regression model that re¬ 
lates average temperature to latitude 
and longitude. 

d. Examine the results of the regression. 
Are both predictor variables statis¬ 
tically significant at the 5% level? 
What is the value? How much of 
the variability in temperature is ex¬ 
plained by longitude and latitude? 

e. Format the regression values gener¬ 
ated by the Analysis ToolPak to dis¬ 
play residual values as integers. Copy 
the map chart from part b to a new 
chart sheet, and delete the tempera¬ 
ture labels. Now label the points us¬ 
ing the residual values. 

f. Interpret your findings. Where are the 
negative values clustered? Where do 
you usually find positive residuals? 

g. Write a report summarizing your 
findings, discussing where the linear 
model fails and why. 

14. The Housing Price workbook contains 

information on home prices in 

Albuquerque, New Mexico. 

a. Open the Housing Price workbook 
from the Chapter09 folder and save it 
as Housing Price Regression. 

b. Regress the price of the houses in the 
sample on three predictor variables: 
Square Feet, Age, and number of 
features. 


c. Examine the plot of residuals versus 
predicted values. Is there any viola¬ 
tion of the regression assumptions 
evident in this plot? 

d. Redo the regression analysis, this time 
regressing the Log Price on the three 
predictor variables. How does the plot 
of residuals versus predicted values 
appear in this model? Did the loga¬ 
rithm correct the problem you noted 
earlier? 

e. There is an outlier in the plot. Iden¬ 
tify the point and describe what this 
tells us about the price of the house if 
the model is correct. 

f. Save your changes to the workbook 
and write a report summarizing your 
observations. 

15. The Unemployment workbook contains 
the U.S. unemployment rate. Federal 
Reserve Board index of industrial 
production, and year of the decade 
1950-1959. Unemployment is the de¬ 
pendent variable; Industrial Production 
and Year of the Decade are the predictor 
variables. 

a. Open the Unemployment workbook 
from the Chapter09 folder and save it 
as Unemployment Regression. 

b. Create a chart sheet showing the scat¬ 
ter plot of Unemployment versus 
FRBJndex. Add a linear trend line to 
the chart. Does unemployment appear 
to rise along with production? 

c. Using the Analysis ToolPak, run a 
simple linear regression of Unemploy¬ 
ment versus FRBJndex. What is the 
regression equation? What is the 
value? Does the regression explain 
much of the variability in unemploy¬ 
ment values during the 1950s? 

d. Rerun the regression, adding Years to 
the regression equation. How does the 
R^ value change with the addition of 
the Years factor? What is the regres¬ 
sion equation? 
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e. Compare the parameter value for 
FRBJndex in the first equation with 
that in the second. How are they dif¬ 
ferent? Does your interpretation of the 
effect of production on unemploy¬ 
ment change from one regression to 
the other? 

f. Calculate the correlation between 
FRBJndex and Year. How significant 
is the correlation? 

g. Save your changes to the workbook 
and write a report summarizing your 
observations. 

17. The Beer Rating workbook contains rat¬ 
ings from ratebeer.com along with val¬ 
ues of IBU (international bittering units, 
a measure of bitterness) and ABV 
(alcohol by volume) for 25 beers. 

a. Open the Beer Rating workbook from 
the Chapter09 folder and save it as 
Beer Rating Regression. 

b. Create a correlation matrix and 
scatterplot matrix for Rating, IBU, 
and ABV. 

c. Which beer has the highest rating? 
The lowest? 


d. Describe the relationships among the 
three variables. Given how the vari¬ 
ables are related, do the correlations 
fully describe the strengths of the re¬ 
lationships? Explain. 

e. Regress Rating on IBU and ABV. 
Notice that, although both predictors 
have strongly significant correlations 
with Rating, they do not both have 
significant regression coefficients. 
How do you explain this? 

f. Plot the residuals from (e) to check 
the assumptions. Which of the as¬ 
sumptions is clearly not satisfied? 
Why should this be expected based 
on (d)? 

g. Repeat the multiple regression in (e) 
with the square of IBU as a third pre¬ 
dictor. Check the assumptions again. 

h. How effective is the regression in (g)? 
Interpret the coefficients with regard 
to statistical significance and sign. In 
particular, discuss the coefficient for 
the square of IBU. 

i. Save your changes to the workbook 
and then write a report summarizing 
your conclusions. 
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Chapter 10 


Analysis of Variance 



In this chapter you will learn to: 

p- Compare several groups graphically 

p- Compare the means of several groups using analysis of variance 

P- Correct for mnltiple comparisons using the Bonferroni test 

P- Find which pairs of means differ significantly 

P- Compare analysis of variance to regression analysis 

P- Perform a two-way analysis of variance 

p- Create and interpret an interaction plot 

p- Check the validity of assnmptions 
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One-Way Analysis of Variance 

Earlier we used the t test to compare two treatment groups, such as two 
groups taught hy two different methods. What if there are fonr treatment 
groups? We might have 40 subjects split into fonr groups, with each group 
receiving a different treatment. The treatments might he four different drngs, 
four different diets, or four different advertising videos. Analysis of variance, 
or ANOVA, provides a test to determine whether to accept or reject the 
hypothesis that all of the gronp means are eqnal. 

The model we’ll nse for analysis of variance, called a means model, is 

7 = hi + e 

Here, /r; is the mean of the ith group, and e is a random error following a 
normal distrihution with mean 0 and variance cr^. If there are P gronps, the 
null and alternative hypotheses for the means model are 

Hq: hi = h2 = ■ ■ ■ = hp- 

Not all of the /j,j are eqnal. 

Note that the assnmptions for the means model are similar to those used 
for regression analysis. 

• The errors are normally distrihuted. 

• The errors are independent. 

• The errors have constant variance cr^. 

The similarity to regression is no accident. As you will see later in 
this chapter, analysis of variance can he thought of as a special case of 
regression. 

To verify analysis of variance assumptions, it is helpfnl to make a plot that 
shows the distrihntion of observations in each of the treatment groups. If the 
plot shows large differences in the spread among the treatment groups, there 
might be a problem of nonconstant variance. If the plot shows ontliers, there 
might be a problem with the normality assumption. Independence could 
also be a problem if time is important in the data collection, in which case 
consecutive observations might be correlated. However, there are usnally no 
problems with the independence assnmption in the analysis of variance. 


Analysis of Variance Example: Comparing 
Hotel Prices 

Some professional associations are reluctant to hold meetings in New York 
City because of high hotel prices and taxes. Are hotels in New York City 
more expensive than hotels in other major cities? 
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To answer this question, let’s look at hotel prices in four major cities: New 
York City, Chicago, Denver, and San Francisco. For each city, a random sam¬ 
ple of eight hotels was taken from the TripAdvisor.com wehsite (February 
2008) and stored in the Hotel workbook. The workbook contains the follow¬ 
ing variables as shown in Table 10-1: 


Table 10-1 The Hotel Workbook 


Range Name 

Range 

Description 

City 

A2:A33 

City of each hotel 

Hotel 

B2:B33 

Name of hotel 

Stars 

C2:C33 

TripAdvisor.com rating (February 2008), on a 
scale from 1 to 5 

Price 

D2:D33 

The room price 


To open the Hotel workbook: 

I Open the Hotel workbook from the ChapterlO folder. 


2 Save the workbook as Hotel ANOVA. The workbook appears as 
shown in Figure 10-1. 


Figure 10-1 
The Hotel 
workbook 
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We have to decide betv\reen two hypotheses. 

Hg! The mean hotel population price is the same for each city. 
H^: The mean hotel population prices are not the same. 


Graphing the Data toVerifyANOVA Assumptions 


It is hest to begin with a graph that shows the distribution of hotel prices in 
each of the four cities. To do this, you can use the multiple histograms com¬ 
mand available in StatPlus. 


Figure 10-2 
Create 
Multiple 
Histograms 
dialog box 


To create the graphs: 

1 Click Multi-variahle Charts from the StatPlus menu and then click 

Multiple Histograms. 

2 Because the workbook is laid out with the variable values in one 
column and the categories in another, verify that the Use a column 
of category levels option button is selected. 

3 Click the Data Values button, and select Price from the list of range 
names. Click OK. 

4 Click the Categories button, and select City from the range names 
list. Click OK. 

5 Click the Display normal curve checkbox, and verify that the 
Frequency option button is selected. 

6 Click the Output button, and send the output to a new worksheet 
named Histograms. Click OK. 

Your completed dialog box should appear as shown in Figure 10-2. 
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Click OK. StatPlus generates the histograms shown in Fignre 10-3. 


Figure 10-3 
Multiple 
histograms 
of prices 
in each city 



What do these histograms tell yon ahont the analysis of variance as- 
snmptions? One of the assnmptions states that the popnlation variance is 
the same in each gronp. If one city has prices that are all hnnched together 
and another has a very wide spread of prices, nneqnal variances can he 
a problem. The plot shows a tendency for the spread to he larger when 
the prices are higher. In particnlar, New York seems to have the highest 
mean price and the biggest spread, and Chicago is second in both mean 
and spread. In this sitnation it sometimes helps to replace the response 
variable with its logarithm. Exercise 7 reqnests that yon replace Price 
with the log of price and to see what effect this has on the analysis. If 
yon find there that the assnmptions seem valid bnt that the resnlts are 
essentially the same, then that tends to validate the original analysis. Here 
we continne with the analysis nsing Price as the response variable, even 
thongh there is a qnestion ahont the eqnal variance assnmption. Generally 
speaking, it is easier to interpret the analysis on Price rather than its loga¬ 
rithm or some other transform. 

What ahont the assnmption of normal data? The analysis of variance is 
robnst to the normality assnmption, so only major departnres from normality 
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would cause trouble. In any case, eight observations in each gronp may be 
too few to determine whether the normality assnmption is violated. 


Computing the Analysis ofVariance 

From the histograms, it appears that New York has the highest mean hotel 
price. Still, there is some overlap between the New York prices and the others. 
Do yon think that New York City is significantly more expensive than the other 
cities? We’ll soon find ont by performing an analysis of variance. To do so, 
we’ll have to nse the Analysis of Variance command available from the Analy¬ 
sis ToolPak, the statistical add-in snpplied with Excel. The Analysis ToolPak 
reqnires that the gronp valnes be placed in separate colnmns. In this workbook, 
gronps are identified by a category variable, so yon’ll have to nnstack the price 
valnes on the basis of the City variable, creating four separate price columns. 

To unstack the Price column: 

Click Manipulate Columns from the StatPlus menu and then click 

Unstack Column. 

Click the Data Values button, and select Price from the range names 
list. Click OK. 

Click the Categories button, select City from the range name list, 
and click OK. 

Deselect the Sort the Columns checkbox. 

Click the Output button and send the unstacked values to a new 
worksheet named Price Data. Click OK. 

Figure 10-4 shows the completed Unstack Column dialog box. 
Click OK. 


1 

2 

3 

4 

5 


Figure 10-4 
The Unstack 
Column 
dialog box 

Price values will be 
placed in separate 
columns 
each column created 
based on a different 
value of the City 
variable 
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The unstacked data are shown in Fignre 10-5. 


Figure 10-5 
The unstacked 
data 



STATPLUS TIPS_ 

• Yon can use the StatPlus’ Stack Columns to stack a series of 
colnmns. The resnlting data set will contain two colnmns: a 
colnmn of data valnes and a colnmn containing the category 
labels. 


Now yon can perform the analysis of variance on the price data. 



2 

3 

4 


To perform the analysis of variance: 

Click Data Analysis from the Analysis gronp on the Data tah, then 
click Anova: Single Factor in the Analysis Tools list hox, and then 
click OK. 

Type Al:D9 in the Inpnt Range text hox, and verily that the Grouped 
By Columns option hntton is selected. 

Click the Labels in First Row checkbox to select it. 

Click the New Worksheet Ply option hntton and type Price ANOVA 
in the corresponding text box. Yonr dialog box shonld look like 
Fignre 10-6. 
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Figure 10-6 
The Anova: 
Single Factor 
dialog box 



5 Click OK. 

Figure 10-7 shows the resulting analysis of variance ontpnt, with 
some minor formatting. 


Figure 10-7 
Analysis 
of variance 
output 
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Interpreting the Analysis ofVariance Table 


In performing an analysis of variance, yon determine what part of the vari¬ 
ance yon shonld attrihnte to randomness and what part yon can attrihnte 
to other factors. Analysis of variance does this hy splitting the total snm of 
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squares (the sum of squared deviations from the mean) into tvro parts: a part 
attrihnted to differences hetvreen the gronps and a part dne to random error 
or random chance. To see how this is done, recall that the formnla for the 
total snm of sqnares is 

Total SS = 

Here, the total nnmher of observations is n, and the average of all obser¬ 
vations is y. The valne for the hotel data is 933,747.9 and is shown in cell 
B16. The sample average (not shown) is 272.5625. 

Let’s express the total SS in a different way. We’ll break the calcnlation 
down by the varions gronps. Assnme that there are a total of P gronps and 
that the size of each group is U; (groups need not be eqnal in size, so n^ 
wonld indicate the sample size of the ith group), and we calcnlate the total 
snm of sqnares for each gronp separately. We can write this as 

Total SS = - yy 

Here, identifies the 7 th observation from the ith gronp (for example, 
723 wonld mean the third observation from the second gronp). Notice that 
we haven’t changed the valne; all we’ve done is specify the order in which 
we’ll calcnlate the total snm of sqnares. We’ll calcnlate the snm of sqnares 
in the first gronp, and then in the second and so forth, adding np all of the 
snms of sqnares in each group to arrive at the total sum of squares. 

Next we’ll calculate the sample average for each group, labeling it 7 , 
which is the sample average of the ith gronp. For example, in the hotel data, 
the values (shown in cells D5:D8) are 


NYC 

481.125 

CHI 

227.625 

DEN 

181.125 

SF 

200.375 


Using the gronp averages, we can calcnlate the total snm of sqnares within 
each gronp. This is eqnal to the snm of the sqnared deviations, where the 
deviation is from each observation to its group average. We’ll call this value 
the error sum of squares, or SSE, and express it as 

SSE = - H)" 

Another term for this valne is the within-groups sum of squares because 
it is the sum of squares within each group. The value for SSE in the hotel 
data is 461,031.5 (shown in cell B14). 

The final piece of the analysis of variance is to calculate the sum of 
squares between each of the group averages and the overall average. This 
value, called the between-groups sum of squares and otherwise known as 
the treatment sum of squares, or SST, is 

SST = - yy 
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Note that we take each squared difference and mnltiply it hy the nnmher 
of observations in the gronp. In this hotel data set, each gronp has eight 
observations, so the valne of U; is always eight. The between-gronps sum of 
squares for the hotel data is equal to 472,716.4 (cell B13]. 

But note that the total snm of sqnares is eqnal to the within-gronps snm 
of sqnares pins the between-gronps snm of sqnares, becanse 933,747.9 = 
461,031.5 -f 472,716.4. In general terms. 

Total SS = SSE -h SST 

Let’s try to relate this to the price of staying at hotels in varions cities. If 
the average prices in the varions cities are very different, the between-gronps 
snm of sqnares will be a large valne. However, if the city averages are close 
in valne, the between-gronps snm of sqnares will be near zero. The argn- 
ment goes the other way, too; a large valne for the between-gronps snm of 
sqnares conld indicate that the city averages are very different, whereas a 
small valne might show that they are not so different. 

A large valne for the between-gronps snm of sqnares could also be due to a 
large number of groups, so you have to adjust for the number of groups in the 
data set. The degrees of freedom [dfl colnmn in the ANOVA table (cells C13: 
Cl 6] tells yon that. The dfioi the city factor (in this case the between-gronps 
term] is the nnmher of gronps minns 1, or 4 — 1 = 3 (cell C13]. The degrees of 
freedom for the total snm of sqnares is the total nnmher of observations minns 1, 
or 32 — 1 = 31 (cell C16]. The remaining degrees of freedom are assigned to the 
error term (the within-gronps term] and are equal to 31 — 3 = 28 (cell C14]. 

The Mean Square (MS] colnmn (cells D13:D14] shows the snm of sqnares 
divided by the degrees of freedom; yon can think of the entries in this col¬ 
nmn as variances. The first valne, 157,572.3 (cell D13], measnres the vari¬ 
ance in hotel cost between the varions cities; the second valne, 16,465.1 (cell 
D14], measnres the variance of the cost within cities. The within-gronps mean 
sqnare also estimates the valne of —the variance of the error term e shown 
in the means model earlier in the chapter. If the variability in hotel prices be¬ 
tween cities is large relative to the variability of hotel prices within the cities, 
then we might conclnde that mean hotel price is not the same for each city. 

To test this, we calcnlate the ratio of the two variances. Under the nnll 
hypothesis, this valne shonld follow an F distribntion with n, m degrees of 
freedom, where n is the degrees of freedom for the between-gronps variance 
and m is the degrees of freedom for the within-gronps variance. 

In the hotel data, the F valne is 9.570 (cell E13] and follows an F(3,28] 
distribntion. Excel calcnlates the p valne to be .000163 (cell F13], which is 
less than .05. We reject the nnll hypothesis, accepting the alternative that 
there is a difference in the mean hotel price. 

Althongh the ontpnt does not show it, yon can nse the valnes in the ANOVA 
table to derive some of the same statistics yon nsed in regression analysis. For 
example, the ratio of the between-gronps snm of sqnares to the total snm of 
sqnares eqnals R^, the coefficient of determination discnssed in some depth 
in Chapters 8 and 9. In this case = 472,716.4/933,747.9 = 0.50626. Thns 
abont 50% of the variability in hotel price is explained by the city of origin. 
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Comparing Means 

The ANOVA table has led you to reject the hypothesis that the mean single- 
room price is the same in all four cities and to accept the alternative that 
the four means are not all the same. Looking at the mean values, you might 
be tempted to conclude that the high price for New York City hotel rooms 
is the cause and leave it at that. This assumption would be unwarranted 
because you haven’t tested for this specific hypothesis. Are there significant 
differences between the other cities as well? To find out, you need to calcu¬ 
late the differences in mean value between all pairs of cities and then test 
the differences to discover their statistical significance. 

Excel does not provide a function to test pairwise mean differences, but 
one has been provided for you with StatPlus. 

To create a matrix of paired differences: 

1 Click Multivariate Analysis from the StatPlus menu and then click 

Means Matrix. 

2 Click the Data Values button, and select Price from the list of range 
names. Click OK. 

3 Click the Categories button, and select City from the range names 
list. Click OK. 

4 Click the Use Bonferroni Correction checkbox. 

5 Click the Output button, and direct the output to a new worksheet 
named Means Matrix. Click OK. 

Figure 10-8 shows the completed dialog box. 


Figure 10-8 
The Create 
Matrix 
of Mean 
Differences 
dialog box 
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Click OK. Excel generates the output shown in Figure 10-9. 


Figure 10-9 
Pairwise 
mean 
difference 
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You can tell from the Pairwise Mean Difference table that the mean cost 
for a single hotel room in Los Angeles is $27.25 less than the mean cost in 
Chicago. The largest difference is between Denver and New York City, with 
a single room in Denver hotels costing $300 less than a single room in New 
York City hotels. Note that the output includes the mean squared error value 
from the ANOVA table, 16,465.41, which is the estimate of the variance of 
hotel prices. 


Using the Bonferroni Correction Factor 

You also requested in the dialog box a table of p values for these mean differ¬ 
ences using the Bonferroni correction factor. Recall from Chapter 8 that the 
Bonferroni procedure is a conservative method for calculating the probabili¬ 
ties by multiplying the p value by the total number of comparisons. Because 
the p values are much higher than you would see if you compared the cities 
with t tests, it is harder to get significant comparisons with the Bonferroni 
procedure. However, the Bonferroni procedure has the advantage of giving 
fewer false positives than t tests would give. 

With the Bonferroni procedure, the chances of finding at least one signifi¬ 
cant difference among the means is less than 5% if all of the four population 
means are the same. On the other hand, if you do six t tests to compare the 
four cities at the 5% level, there is much more than a 5% chance of get¬ 
ting significance in at least one of the six tests if all four population means 
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are the same. Other methods are available to help you adjust the p value for 
multiple comparisons, including Tnkey’s and Scheffe’s, hnt the Bonferroni 
method is the easiest to implement in Excel, which does not provide a 
correction procednre. 

Note: Essentially, the difference between the Bonferroni procednre and a 
t test is that for the Bonferroni procednre, the 5% applies to all six compari¬ 
sons together hnt for t tests, the 5% applies to each of the six comparisons sep¬ 
arately. In statistical langnage, the Bonferroni procednre is testing at the 5% 
level experimentwise, whereas the t test is testing at the 5% level pairwise. 

The pairwise comparison probabilities show that the three biggest differ¬ 
ences are significant (highlighted in red). The New York city room price is 
higher than the room price in the other three cities, hnt none of those three 
cities are significantly different in price from each other. 


When to Use Bonferroni 

As the size of the means matrix increases, the nnmber of comparisons in¬ 
creases as well. Conseqnently, the p valnes for the pairwise differences are 
greatly inflated. As yon can imagine, there might be a point where there are 
so many comparisons in the matrix that it is nearly impossible for any one 
of the comparisons to be statistically significant nsing the Bonferroni cor¬ 
rection factor. Many statisticians are concerned abont this problem and feel 
that althongh the Bonferroni correction factor does gnard well against incor¬ 
rectly finding significant differences, it is also too conservative and misses 
trne differences in pairs of mean valnes. 

In such situations, statisticians make a distinction between paired com¬ 
parisons that are planned before the data are analyzed and those that occnr 
only after we look at the data. For example, the planned comparisons here are 
the differences in hotel room price between New York City and the others. 
Yon shonld be carefnl with new comparisons that yon come np with after 
yon have collected the data. Yon shonld hold these comparisons to a mnch 
higher standard than the comparisons yon’ve planned to make all along. This 
distinction is important in order to ward off the effects of data “snooping” 
(nnplanned comparisons). Some statisticians recommend that yon do the 
following when analyzing the paired means differences in yonr analysis of 
variance: 

1. Condnct an F test for eqnal means. 

2. If the F statistic is significant at the 5% level, make any planned compar¬ 
isons yon want withont correcting the p valne. For data snooping, nse a 
correction factor snch as Bonferroni’s on the p valne. 

3. If the F statistic for eqnal means is not significant, yon can still consider 
any planned comparisons, hnt only with a correction factor to the p valne. 
Do not analyze any nnplanned comparisons (Milliken and Johnson 1984). 

It shonld be emphasized that althongh some statisticians embrace this 
approach, others qnestion its validity. 
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Comparing Means with a Boxplot 

Earlier you used multiple histograms to compare the distrihution of hotel prices 
among the different cities. The hoxplot is also very nseful for this task hecanse 
it shows the hroad ontline of the distrihntions and displays the medians for 
the fonr cities. Recall that if the data are very hadly skewed, the mean might he 
strongly affected hy ontlying valnes. The median wonld not have this problem. 

To create a boxplot of price versus city: 

1 Click Single Variable Charts from the StatPlns menn and then click 

Boxplots. 

2 Click the Data Values button, and select Price from the list of range 
names. Click OK. 

3 Click the Categories button, and select City from the range names 
list. Click OK. 

4 Click the Output button, and direct the output to a new chart sheet 
named Boxplots. Click OK. 

5 Click OK. The resulting boxplots are shown in Figure 10-10. 


Figure 10-10 
Boxplot of 
room prices 
for each city 



Compare the medians, indicated by the middle horizontal line (not dotted) 
in each box. The median for San Francisco is above the median for Chicago 
even though you discovered from the pairwise mean difference matrix that 
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the mean price in Chicago is $27.25 above the mean in San Francisco. The 
reason for the difference is an outlier in the sample of Chicago room prices. 
This outlier has a hig effect on the Chicago mean price, hut not on the me¬ 
dian. The median is much more robust to the effect of outliers. 


One-Way Analysis of Variance and Regression 

You can think of analysis of variance as a special form of regression. In the 
case of analysis of variance, the predictor variables are discrete rather than 
continuous. Still, you can express an analysis of variance in terms of regres¬ 
sion and, in doing so, can get additional insights into the data. To do this, 
you have to reformulate the model. 

Earlier in this chapter you v\rere introduced to the means model 

y=fii + e 

for the ith treatment group. An equivalent vray to express this relationship 
is v\rith the effects model 


y = /x + Uj + e 

Here /r is a mean term, is the effect from the ith treatment group, and e 
is a normally distributed error term with mean 0 and variance cr^. 

Let’s apply this equation to the hotel data. In this data set there are four 
groups representing the four cities, so you would expect the effects model 
to have a mean term /r and four effect terms a^, a^, Uj and representing 
the four cities. There is a problem, however: You have five parameters in 
your model, but you are estimating only four mean values. This is an ex¬ 
ample of an overparametrized model, where you have more parameters 
than response values. As a result, an infinite number of possible values 
for the parameters will solve the equation. To correct this problem, you 
have to reduce the number of parameters. Statistical packages generally 
do this in one of two ways: Either they constrain the values of the effect 
terms so that the sum of the terms is zero, or they define one of the effect 
terms to be zero (Milliken and Johnson, 1984). Let’s apply this second ap¬ 
proach to the hotel data and perform the analysis of variance using regres¬ 
sion modeling. 


Indicator Variables 

To perform the analysis of variance using regression modeling, you can cre¬ 
ate indicator variables for the data. Indicator variables take on values of 
either 1 or 0, depending on whether the data belong to a certain treatment 
group or not. For example, you can create an indicator variable where the 


406 Statistical Methods 




variable value is 1 if the observation comes from a hotel in San Francisco or 
0 if the observation comes from a hotel not in San Francisco. 

Yon’ll nse the indicator variables to represent the terms in the effects 
model. 


To create indicator variables for the hotel data: 

1 Click the Hotel worksheet tab (yon might have to scroll to see it) to 
retnrn to the worksheet containing the hotel data. 

2 Click Manipulate Columns from the StatPlns menu and then click 

Create Indicator Columns. 

3 Click the Categories button, and select City from the list of range 
names. Click OK. 

4 Click the Output button, click the Cell option button, and select cell 
FI. Click OK. 

5 Click OK. 

Excel generates the four new columns shown in Figure 10-11. 


Figure 10-11 
Indicator 
variables in 
columns F:l 
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The values in column F, labeled I (“NYC”), are equal to 1 if the values in 
the row come from a hotel in New York City or 0 if they do not. Similarly, 
the values for the next three columns are 1 if the observations come from 
Chicago, Denver, and San Francisco, respectively, or 0 otherwise. 
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Fitting the Effects Model 


With these columns of indicator variables, you can now fit the effects model 
to the hotel pricing data. 

To fit the effects model using regression analysis: 

1 Click the Data Analysis button from the Analysis group on the 
Data tab, then click Regression in the Analysis Tools list box, and 
click OK. 

2 Type Dl:D33 in the Input Y Range text box, press [Tab], and then 
type Fl:H33 in the Input X Range text box. 

Recall that you have to remove one of the effect terms to keep from 
overparametrizing the model. For this example, remove the New 
York effect term. (You could have removed any one of the four city 
effect terms.) 

3 Click the Labels checkbox to select it because the range includes a 
header row. 

4 Click the New Worksheet Ply option button; then type Effects Model 
in the corresponding text box. 

5 Verify that all four Residuals checkboxes are deselected; then 
click OK. 

The regression output appears as in Figure 10-12. (The columns are 
resized to show the labels.) 


Figure 10-12 
Created effects 
model with the 
Regression 
command 
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The analysis of variance table prodnced by the regression (cells A10:F14) 
and shown in Fignre 10-12 shonld appear familiar to yon becanse it is eqniv- 
alent to the ANOVA table created earlier and shown in Fignre 10-7. There are 
two differences: the Between Gronps row from the earlier ANOVA table is 
the Regression row in this table, and the Within Gronps row is now termed 
the Residnal row. 

The parameter valnes of the regression are also familiar. The intercept co¬ 
efficient 481.125 (cell Bl7] is the same as the mean price in New York. The 
values of the GHI, DEN, and SF effect terms now represent the difference be¬ 
tween the mean hotel price in these cities and the price in New York. Note 
that this is exactly what you calculated in the matrix of paired mean differ¬ 
ences shown in Figure 10-9. The p values for these coefficients are the un¬ 
corrected p values for comparing the paired mean differences between these 
cities and New York. If you multiplied these p values by 6 (the number of 
paired comparisons in the paired mean differences matrix), you would have 
the same p values shown in Figure 10-9. 

Gan you see how the use of indicator variables allowed you to create the 
effects model? Gonsider the values for I (“GHI”). For any non-Ghicago hotel, 
the value of the indicator variable is 0, so the effect term is multiplied by 0, 
and therefore has no impact on the estimate of the hotel price. It is only for 
Ghicago hotels that the effect term is present. 

As you can see, using regression analysis to fit the effects model gives you 
much of the same information as the one-way analysis of variance. 

The model you’ve considered suggests that the average price for a single 
room at a hotel in New York Gity is significantly higher than that for a sin¬ 
gle room in a hotel in Ghicago, Denver, or San Francisco. You can expect 
to pay about an average of $481 for a single room in New York Gity, and 
$253.50 less than this in Ghicago, $300 less in Denver, and $280.75 less 
in San Francisco. You’ve completed your study of the hotel data in this 
workbook. You can close the Hotel ANOVA workbook now, saving your 
changes. 

EXCEL TIPS_ 

^ - • You can use the Regression command to calculate the means 

model instead of the effects model. To do this, run the Analysis 
ToolPak’s Regression command, choose all of the indicator 
variables in the Input X Range text box, and select the Gonstant 
Is Zero checkbox. This will remove the constant term from the 
model. The parameter estimates will correspond to mean values 
of the different groups. 
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Two-Way Analysis of Variance 

One-way analysis of variance compares several groups corresponding to a 
single categorical variable, or factor. A two-way analysis of variance uses 
two factors. In agriculture, for example, you might be interested in the ef¬ 
fects of both potassium and nitrogen on the growth of potatoes. In medicine 
you might want to study the effects of medication and dose on the duration 
of headaches. In education you might want to study the effects of grade level 
and gender on the time required to learn a skill. A marketing experiment 
might consider the effects of advertising dollars and advertising medium 
(television, magazines, and so on] on sales. 

Recall that earlier in the chapter you looked at the means model for a 
one-way analysis of variance. Two-way analysis of variance can also be ex¬ 
pressed as a means model: 


yijk T ^ijk 

where y is the response variable and is the mean for the ith level of one 
factor and the 7 th level of the second factor. Within each combination of the 
two factors, you might have multiple observations called replicates. Here 
Ejjj^ is the error for the ith level of the first factor, the 7 th level of the second 
factor, and the kth replicate, following a normal distribution with mean 0 
and variance 

The model is more commonly presented as an effects model where 
Yijk = p- + ai + I3j + a^ij -t Siji, 

Here y is the response variable, /r is the overall mean, is the effect of 
the ith treatment for the first factor, and fSj is the effect of the 7 fh treatment 
for the second factor. The term represents the interaction between the 
two factors, that is, the effect that the two factors have on each other. For 
example, in an experiment where the two factors are advertising dollars and 
advertising medium, the effect of an increase in sales might be the same 
regardless of what advertising medium (radio, newspaper, or television] is 
used, or it might vary depending on the medium. When the increase is the 
same regardless of the medium, the interaction is 0 ; otherwise, there is an 
interaction between advertising dollars and medium. 


A Two-Factor Example 

To see how different factors affect the value of a response variable, con¬ 
sider an example of the effects of four different assembly lines (A, B, C, 
or D] and two shifts (a.m. or p.m.] on the production of microwave ovens 
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for an appliance mannfactnrer. Assembly line and shift are the two factors; 
the assembly line factor has fonr levels, and the shift factor has two levels. 
Each combination of the factors line and shift is called a cell, so there are 
4X2 = 8 cells. The response variable is the total nnmber of microwaves 
assembled in a week for one assembly line operating on one particnlar shift. 
For each of the eight combinations of assembly line and shift, six separate 
weeks’ worth of data are collected. 

Yon can describe the mean nnmber of microwaves created per week with 
the effects model where 

Mean nnmber of microwaves = overall mean + assembly line effect 

+ shift effect + interaction + error 

Now let’s examine a possible model of how the mean nnmber of micro- 
waves prodnced conld vary between shifts and assembly lines. Let the over¬ 
all mean nnmber of microwaves prodnced for all shifts and assembly lines 
be 240 per week. Now let the fonr assembly line effects be A, h-66 (that is, 
assembly line A prodnces on average 66 more microwaves than the overall 
mean); B, — 2 ; C, —100; and D, h- 36. Let the two shift effects be p.m., — 6 , and 
a.m., h-6 . Notice that the fonr assembly line effects add np to zero, as do the 
two shift effects. This follows from the need to constrain the valnes of the 
effect terms to avoid overparametrization, as was discnssed with the one¬ 
way effects model earlier in this chapter. 

If yon exclnde the interaction term from the model, the popnlation cell 
means (the mean nnmber of microwaves prodnced) look like this. 

A B C D 

p.m. 300 232 134 270 

a.m. 312 244 146 282 

These valnes are obtained by adding the overall mean + the assembly line 
effect + the shift effect for each of the eight cells. For example, the mean for 
the p.m. shift on assembly line A is 

Overall mean -I- assembly line effect -I- shift effect = 240 -I- 66 — 6 = 300 

Withont interaction, the difference between the a.m. and the p.m. shifts 
is the same ( 12 ) for each assembly line. You can say that the difference be¬ 
tween a.m. and p.m. is 12 no matter which assembly line you are talking 
about. This works the other way, too. For example, the difference between 
line A and line C is the same (166) for both the p.m. shift (300 — 134) and 
the a.m. shift (312 — 146). You might understand these relationships better 
from a graph. Figure 10-13 shows a plot of the eight means with no interac¬ 
tion (you don’t have to produce this plot). 
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Figure 10-13 
Means plot 
without an 
interaction 
effect 


Ht*n Numbai •! Mlaowtvm Pi*4ucari 




The cell means are plotted against the assembly line factor using separate 
lines for the shift factor. This is called an interaction plot; you’ll create one 
later in this chapter. 

Because there is a constant spacing of 12 between the two shifts, the lines 
are parallel. The pattern of ups and downs for the p.m. shift is the same as 
the pattern of ups and downs for the a.m. shift. 

What if interaction is allowed? Suppose that the eight cell population 
means are as follows: 

A B C D 

p.m. 295 235 175 200 

a.m. 317 241 142 220 

In this situation, the difference between the shifts varies from assembly 
line to assembly line, as shown in Figure 10-14. This means that any infer¬ 
ence on the shift effect must take into account the assembly line. You might 
claim that the a.m. shift generally produces more microwaves, but this is 
not true for assembly line C. 


Figure 10-14 
Means plot 
with an 
interaction 
effect 
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The assumptions for two-way ANOVA are essentially the same as those for 
one-way ANOVA. For one-way ANOVA, all observations on a treatment were 
assumed to have the same mean, hut here all observations in a cell are assumed 
to have the same mean. The two-way ANOVA assumes independence, con¬ 
stant variance, and normality, just as the one-way ANOVA (and regression). 


Two-Way Analysis Example: Comparing 
Soft Drinks 

The Cola workbook contains data describing the effects of cola (Coke, Pepsi, 
Shasta, or generic) and type (diet or regular) on the foam volume of cola soft 
drinks. Cola and type are the factors; cola has four levels, and type has two 
levels. There are, therefore, eight combinations, or cells, of cola brand and 
soft drink type. For each of the eight combinations, the experimenter pur¬ 
chased and cooled a six-pack, so there are 48 different cans of soda. Then 
the experimenter chose a can at random, poured it in a standard way into a 
standard glass, and measured the volume of foam. 

Why would it be wrong to test all of the regular Coke first, then the diet 
Coke, and so on? Although the experimenter might make every effort to keep 
everything standardized, trends that influence the outcome could appear. 
For example, the temperature in the room or the conditions in the refrig¬ 
erator might change during the experiment. There could be subtle trends in 
the way the experimenter poured and measured the cola. If there were such 
trends, it would make a difference which brand was poured first, so it is best 
to pour the 48 cans in random order. 

The Cola workbook contains the variables shown in Table 10-2. 

Table 10-2 Data for Cola Workbook 


Range Name 

Range 

Description 

Can_No 

A2:A49 

The number of the can (1-6) in the six-pack 

Cola 

B2:B49 

The cola brand 

Type 

C2:C49 

Type of cola: regular or diet 

Foam 

D2:D49 

The foam content of the cola 

Cola_Type 

E2:E49 

The brand and type of the cola 


To open the Cola workbook: 

1 Open the Cola workbook from the ChapterlO data folder. 

2 Save the workbook as Cola ANOVA. The workbook appears as 
shown in Figure 10-15. 
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Figure 10-15 
Cola workbook 



Graphing the Data to Verify Assumptions 


Before performing a two-way analysis of variance on the data, you should 
plot the data values to see whether there are any major violations of the as¬ 
sumptions of equal variahility in the different cells. Note that you can use 
the Cola_Type variable to identify the eight cells. 


1 

2 

3 

4 

5 

6 


To create multiple histograms of the foam data: 

Click Multi-variahle Charts from the StatPlus menu and then click 

Multiple Histograms. 

Click the Data Values button, select Foam from the range names list, 
and click OK. 

Click the Categories button, select Cola_Type from the range names 
list, and click OK. 

Click the Display normal curve checkbox. 

Click the Output button, send the charts to a new worksheet named 
Histograms, and click OK. 

Click OK to generate the histograms shown in Figure 10-16. 
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Figure 10-16 
Multiple 
histograms 
of cola type 



Because of the number of charts, you must either reduce the zoom factor 
on yonr worksheet or scroll vertically throngh the worksheet to see all the 
plots. Do yon see major differences in spread among the eight gronps? If so, 
it wonld snggest a violation of the eqnal-variances assnmption, becanse all 
of the gronps are snpposed to have the same population variance. The his¬ 
tograms seem to indicate a greater variability in the generic colas and the 
Shasta brand, whereas less variability is indicated for the Coke and Pepsi 
brands. Once again, the two-way ANOVA is fairly robnst with respect to the 
constant variance assnmption, so this might not invalidate the analysis. 

Yon shonld also look for ontliers becanse extreme observations can make 
a big difference in the resnlts. An ontlier conld be the resnlt of a strange can 
of cola, a wrong observation, a recording error, or an error in entering the 
data. To gain fnrther insight into the distribntion of the data, create a box- 
plot of each of the eight combinations of brand and type. 

To create boxplots of the foam data: 

Click Single Variable Charts from the StatPlns menn and then click 

Boxplots. 

Click the Data Values bntton, select Foam from the range names list, 
and click OK. 


1 

2 
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J Click the Categories button, select Cola_Type from the range names 
list, and click OK. 

4 Click the Output button, and send the outpnt to a new chart sheet 
named Boxplots. Click OK. 

5 Click OK to create the boxplots. 

6 You improve the chart by editing the labels at the bottom of the 
boxplot, removing the text string Cola_Type= from each label, and 
increasing the font size. See Figure 10-17. 


Figure 10-17 
Boxplots of 
foam versus 
cola type 



From the boxplots, you can see that there are no extreme outliers evident 
in the data, bnt there are several moderate outliers; perhaps most notewor¬ 
thy are the outliers for regular Pepsi and regular Shasta. An advantage of the 
boxplot over the multiple histograms is that it is easier to view the relative 
change in foam volume from diet to regular for each brand of cola. The first 
two boxplots represent the range of foam values for regular and diet Coke, 
respectively, after which come the Pepsi valnes, Shasta valnes, and, finally, 
the generic values. Notice in the plot that the same pattern occurs for both 
the diet and the regular colas. Coke is the highest, Pepsi is the lowest, and 
Shasta and generic are in the middle. The difference in the foam between 
the diet and the regular sodas does not depend mnch on the cola brand. 
This suggests that there is no interaction between the cola effect and the 
type effect. 

On the basis of this plot, can you draw preliminary conclusions regard¬ 
ing the effect of type (diet or regular] on foam volume? Does it appear that 
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there is much difference due to cola type (diet or regular)? Because the foam 
levels do not appear to differ much between the two types, you can expect 
that the test for the type effect in a two-way analysis of variance will not he 
significant. However, look at the differences among the four brands of colas. 
The foamiest can of Pepsi is below the least foamy can of Coke, so you might 
expect that there will be a significant cola effect. 


The Interaction Plot 

The histograms and boxplots give us an idea of the influence of cola type 
and cola brand on foam volume. How do we graphically examine the in¬ 
teraction between the two factors? We can do so by creating an interaction 
plot, which displays the average foam volume for each combination of fac¬ 
tors. To do this, we take advantage of Excel’s pivot table feature. 

To set up the pivot table: 

1 Click the Cola Data sheet tab to return to the data. 

2 Click the PivotTable button from the Tables group of the Insert tab. 

3 Verify that the New Worksheet option button is selected and then 
click the OK button. 

Excel opens a new worksheet displaying the fields from the list in a 
PivotTable Field list pane. 

4 Drag the Type field from the field list and drop it in the Column 
Labels box. 

5 Drag the Cola field from the list of fields and drop it into the Row 
Labels box. 

6 Drag the Foam field from the field list and drop it in the Values box. 

7 Click Sum of Foam in the Values box and select Value Field Settings 
from the pop-up menu. 

8 Click Average in the Summarize Value field by List box then 
click OK. 

9 Click the Grand Totals button from the Layout group on the Design 
tab of the PivotTable Tools ribbon and select Off for Rows and 
Columns to run off the grand total for the rows and columns of the 
PivotTable. 
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Figure 10-18 
Interaction 
plot of 
cola type 
versus cola 
brand 


Next you can create a line chart based on the valnes in the PivotTable. 

To create a line chart of the cell average: 

1 Click the PivotChart bntton from the Tools gronp on the Options tab 
of the PivotTable Tools ribbon. 

2 Click Line from the list of chart types. 

3 Click the first chart snb type in the list and click OK. 

Excel creates the PivotChart of the PivotTable data as a line chart. 

4 With the PivotChart selected, click the Move Chart bntton from the 
Location gronp on the Design tab of the PivotChart Tools ribbon and 
move the chart to a new chart sheet named Interaction Plot. 

Fignre 10-18 displays the chart. 



The plot shows that the foam volnmes of diet and regnlar colas are very 
close, except for Coke. If there is no interaction between cola brand and 
cola type, the difference in foam volnme for diet and regnlar shonld be the 
same for each cola brand. This means that the lines wonld move in parallel, 
always with the same vertical distance. Of conrse, there is a certain amonnt 
of random variation, so the lines will nsnally not be perfectly parallel. The 
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plot would seem to indicate that there is no interaction between cola brand 
and cola type. To confirm onr visnal impression, weTl have to perform a 
two-way analysis of variance. 


Using Excel to Perform a Two-Way Analysis of Variance 

The Analysis ToolPak provides two versions of the two-way analysis of vari¬ 
ance. One is for sitnations in which there is no replication of each com¬ 
bination of factor levels. That wonld be the case in this example if the 
experimenter had tested only one can of soda for each cola brand and type. 
However, the experiment has been done with six cans, so yon shonld per¬ 
form a two-way analysis of variance with replication. 

Note that the nnmber of cans for each cell of brand and type mnst be the 
same. Specifically, yon cannot nse data that have five cans of diet Coke and 
six cans of regnlar Coke. Data with the same nnmber of replications per cell 
are called balanced data. If the nnmber of replicates is different for different 
combinations of brand and type, yon cannot nse the Analysis ToolPak’s two- 
way analysis of variance command. 

Finally, to nse the Analysis ToolPak on this data set, it mnst be organized 
in a two-way table. Fignre 10-19 shows this table for the cola data. The data 
are formatted so that the first factor (the fonr cola brands] is displayed in the 
columns, and the second factor (diet or regnlar] is shown in the rows of the 
table. Replications (the six cans in each pack] occnpy six snccessive rows. 
Each cell in the two-way table is the valne of the foam volnme for a particn- 
lar can. Yon can create this table nsing the Create Two-Way Table command 
inclnded with StatPlns. 


Figure 10-19 
Two-way 
table of 
foam values 
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To create a two-way table: 

1 Return to the Cola Data worksheet. 

2 Click Manipulate Columns from the StatPlus menu and then click 

Create Two-Way Table. 

3 Click the Data Values hntton, select Foam from the range name list, 
and click OK. 

4 Click the Column Levels button, select Cola from the list of range 
names, and click OK. 

5 Click the Row Levels button, and select Type from the range names. 
Click OK. 

6 Click the Output button, and direct the output to a new worksheet 
named Two-Way Table. Figure 10-20 shows the completed dialog box. 


Figure 10-20 
The Make a 
Two-way 
Table dialog 
box 



7 Click OK. 


The structure of the data on the Two-Way Table worksheet now resembles 
Figure 10-19, and you can now use the Analysis ToolPak to compute the 
two-way ANOVA. 


1 

2 


To calculate the two-way analysis of variance: 

Click the Data Analysis button from the Analysis group on the Data 
tab, then click Anova: Two-Factor With Replication in the Analysis 
Tools list box, and then click OK. 

Type A2:E14 in the Input Range text box, press [Tab]. 
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You have to indicate the number of replicates in the two-way table 
for this command. 

3 Type 6 in the Rows per sample text box. 

4 Click the New Worksheet Ply option button, and type Two-Way 
ANOVA in the corresponding text box. 

Your dialog box should look like Figure 10-21. 


Figure 10-21 
Anova: 
Two-Factor 
With 
Replication 
dialog box 



5 Click OK. 


EXCEL TIPS_ 

^ • If there is only one observation for each combination of the two 

factors, use Anova: Two-Factor Without Replication. 

• If there is more than one observation for each combination of the 
two factors, use Anova: Two-Factor With Replication. 

• If there are blanks for one or more of the factor combinations, 
you cannot use the Analysis ToolPak to perform two-way 
ANOVA. 

• You can calculate the p value for the F distribution using Excel’s 
FDIST(F, dfl, df2), where F is the value of the F statistic, dfl is 
the degrees of freedom for the factor, and df2 is the degrees of 
freedom for the error term. 
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Figure 10-22 
Two-way 
ANOVA table 

type effect (diet or 
regular) 

cola effect (cola, pepsi, 
shasta, or generic) 
interaction of the type 
and cola effect 
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Interpreting the Analysis ofVariance Table 

The Analysis ofVariance table appears as in Fignre 10-22, with the colnmns 
resized to show the labels (yon might have to scroll to see this part of the 
ontpnt). 

There are three effects now, whereas the one-way analysis had jnst one. 
The three effects are Sample for the type effect (row 25), Colnmns for the 
cola effect (row 26), and Interaction for the interaction between type and 
cola (row 27). The Within row (row 28) displays the within snm of sqnares, 
also known as the error snm of sqnares. 

As we saw earlier with the one-way ANOVA, the two-way ANOVA breaks 
the total snm of sqnares into different parts. If we designate SST as the 
snm of sqnares for the cola type, SSC as the snm of sqnares for cola brand, 
SSI for the interaction between brand and type, and SSE for random 
error, then 


Total = SST -t SSC -t SSI -t SSE 


In this data set, the valnes for the varions snms of sqnares are 


SST 

1,880.00 

SSC 

183,750.50 

SSI 

4,903.38 

SSE 

73,572.58 


The degrees of freedom for each factor are eqnal to the nnmber of levels 
in the factor minns 1. There are two cola types, diet and regnlar, so the de¬ 
grees of freedom are 1. There are 3 degrees of freedom in the fonr cola brands 
(Coke, Pepsi, Shasta, and generic). The degrees of freedom for the interaction 
term are eqnal to the prodnct of the degrees of freedom for the two factors. In 
this case, that wonld be 1 X 3 = 3. Finally, there are n — 1, or 47, degrees of 
freedom for the total snm of sqnares, leaving 47 — (l-l-3-l-3)=40 degrees 
of freedom for the error snm of sqnares. Note that the total degrees of freedom 
are eqnal to the snm of the degrees of freedom for each term in the model. In 
other words, if DFT are the degrees of freedom for the cola type, DFC are the 
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degrees of freedom for cola brand, DFI are the interaction degrees of freedom, 
and DFE are the degrees of freedom for the error term, then 

Total degrees of freedom = DFT + DFC + DFI + DFE 

The next colnmn of the two-way ANOVA table displays the mean sqnare 
of each of the factors (eqnal to the snm of sqnares divided by the degrees of 
freedom). These are 


Type 1,880.00 

Cola 61,250.17 

Interaction 1,634.46 

Error 1,839.31 

These valnes are the variances in foam volnme within the varions factors. 
The largest variance is displayed in the cola factor; this indicates that this is 
where the greatest difference in foam volnme lies. The mean sqnare valne for 
the error term 1839.31 is an estimate of cr^, the variance in foam volnme after 
acconnting for the factors of cola brand, type, and the interaction between 
the two. In other words, after acconnting for these effects in yonr model, 
the typical deviation—or standard deviation—in foam volnme is abont 
Vl840 = 42.9. 

As with one-way ANOVA, the next colnmn of the table displays the ratio of 
each mean sqnare to the mean sqnare of the error term. These ratios follow a 
F[m, n) distribntion, where m is the degrees of freedom of the factor (type, cola 
or interaction) and n is the degrees of freedom of the error term. By comparing 
these valnes to the F distribntion. Excel calculates the p values (cells F25:F27) 
for each of the three effects in the model. Examine first the interaction p value, 
which is .455 (cell F27)—much greater than .05 and not even close to indicat¬ 
ing significance at the 5% level. This confirms what we suspected from view¬ 
ing the interaction plot. Now let’s look at the type and cola factors. 

The column or cola effect is highly significant, with a p value of 5.84 X 
10“^^ (cell F26). This is less than .05, so there is a significant difference 
among colas at the 5% level (because the p value is less than .001, there 
is significance at the 0.1% level, too). However, the p value is .318 for the 
sample or type effect (cell F25), so there is no significant difference between 
diet and regular. 

These quantitative conclusions from the analysis of variance are in agree¬ 
ment with the qualitative conclusions drawn from the boxplot: There is a 
significant difference in foam volume between cola brands, but not between 
cola types. Nor does there appear to be an interaction between cola brand 
and type in how they influence foam volume. 

Finally, how much of the total variation in foam volume has been explained 
by the two-way ANOVA model? Recall that the coefficient of determination 
[R^ value) is equal to the fraction of the total sum of squares that is explained 
by the sums of squares of the various factors. In this case that value is 


(1880.00 183,750.50 -t 4903.38) 

264,106.46 


0.721 
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Thus about 72% of the total variation in foam volnme can be attribnted to 
differences in cola brand, cola type, and the interaction betv\reen cola brand and 
type. Only abont 28% of the total variation can be attribnted to random canses. 


Summary 

To snmmarize the resnlts from the plots and the analysis of variance, v\re 

conclnde the following: 

1. There is no reason to reject the hypothesis that foam volnme is the same 
regardless of cola type (diet or regnlar). 

2. There is a significant difference among the fonr cola brands (Coke, Pepsi, 
Shasta, and generic) with respect to foam volnme. Coke has the highest 
volnme of foam, Pepsi has the lowest, and the other two brands fall in 
the middle. 

3. There is no significant interaction between cola type and cola brand. In 
other words, we don’t reject the nnll hypothesis that the difference in 
foam volnme between diet and regnlar is the same for all fonr brands. 


Yon can save and close the Cola ANOVA workbook now. 


Exercises 

1. Define the following terms: 

a. Error snm of sqnares 

b. Within-gronps sum of squares 

c. Between-groups sum of squares 

d. Mean sqnare error 

2. Which valne in the ANOVA table gives 
an estimate of cr^, the variance of the 
error term in the means model? 

3. If the between-gronps mean sqnare 
error = 7,000 and the within-gronps 
mean sqnare = 2,000, what is the valne 
of the F ratio? If the degrees of freedom 
for the between-gronps and within- 
gronps are 4 and 14, respectively, what 
is the p valne of the F ratio? 


4. What is the Bonferroni correction factor 
and when shonld yon nse it? 

5. Use Excel to calcnlate the p valne for the 
following: 

a. F = 2.5; nnmerator df = 10; 
denominator df = 20 

b. F = 3.0; nnmerator df = 10; 
denominator df = 20 

c. F = 3.5; nnmerator df = 10; 
denominator df = 20 

d. F = 4.0; nnmerator df = 10; 
denominator df = 20 

e. F = 4.5; nnmerator df = 10; 
denominator df = 20 

6. Yon’re performing a two-way ANOVA 
on an edncation stndy to evalnate a new 
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teaching method. The two factors are 
region (East, Midwest, Sonth, or West) 
and teaching method (standard or ex¬ 
perimental). Schools are entered into 
the stndy, and their average test scores 
are recorded. There are five replicates 
for each combination of the region and 
method factors. 

a. Using the information ahont the de¬ 
sign of the study, complete the follow¬ 
ing ANOVA table: 


Term 

SS 

df 

MS 

F 

Region 

9,305 

? 

? 

? 

Method 

12,204 

? 

? 

? 

Interaction 

6,023 

? 

? 

? 

Error 

? 

? 

? 


Total SS 

60,341 

? 




b. What is the value of the ANOVA 
model? 

c. Use Excel’s FDIST function to calcu¬ 
late the p values for each of the factors 
and the interaction term in the model. 

d. State your conclusions. What factors 
have a significant impact on the test 
scores? Is there an interaction be¬ 
tween region and teaching method? 

7. In analyzing the hotel data there ap¬ 
peared to be a problem of unequal popu¬ 
lation variances. Does it help to use the 
logarithm of price in place of price? 

a. Open the Hotel workbook from the 
ChapterlO data folder and save it as 

Hotel Log ANOVA. 

b. Compute a new variable LogPrice, the 
natural log of price. 

c. Repeat the one-way ANOVA using 
LogPrice in place of Price (remember, 
you will have to unstack the data to 
use the Analysis ToolPak). Does there 
now appear to be a problem of un¬ 
equal population variances? 


d. Recalculate the matrix of paired dif¬ 
ferences (use the Bonferroni correc¬ 
tion in calculating the p values). 

e. Save your workbook and write a report 
summarizing your results. Do your 
conclusions differ in any important 
way from what was obtained for Price? 

8. The Hotel Two-Way workbook is taken 
from the same source as the Hotel work¬ 
book, except that the data are balanced 
for a two-way ANOVA. This means that 
the random sample was forced to have the 
same number of hotels in each of 20 cells 
of city and stars (four levels of city and 
five levels of stars). For each of the 20 cells 
specified by a level of city and a level of 
stars, a random sample of two hotels was 
taken. Therefore, the sample has 40 hotels. 
Included in the file is a variable, city stars, 
which indicates the combination of city 
and stars. Perform the following analysis: 

a. Open the Hotel Two-Way workbook 
from the ChapterlO folder and save it 
as Hotel Two-Way? ANOVA. 

b. Using Excel’s PivotTable feature, create 
an interaction plot of the average hotel 
price for the different combinations of 
city and stars. Is there evidence of an 
interaction apparent in the plot? 

c. Do a two-way ANOVA for price versus 
stars and city. (You will have to create 
a two-way table that has stars as the 
row variable and city as the column 
variable.) Is there a significant interac¬ 
tion? Are the main effects significant? 

d. On the basis of the means for the five 
levels of stars, give an approximate 
figure for the additional cost per star. 

e. Compare the city effect in this model 
to the one-way analysis, which did 
not take into account the rating for 
each hotel. 

f. As the number of stars increases, the 
mean price increases approximately 
linearly. Graph price versus stars. 

Break down the chart into categories 
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on the basis of the city variable and 
then add trend lines to each of the fonr 
cities. Inclnde the four regression equa¬ 
tions on the chart. Do the slopes appear 
to be the same for the different cities? 
g. Save your changes to the workbook 
and write a report summarizing your 
observations. 

9. Continue to explore the data from the 
Cola workbook discussed in this chapter 
by performing the following analysis: 

a. Open the Cola workbook from the 
Chapter 10 folder and save it as Cola 
Oneway ANOVA. 

b. Create boxplots and multiple histo¬ 
grams of the foam variable for the dif¬ 
ferent cola brands. 

c. Because the two-way ANOVA per¬ 
formed in the chapter showed that the 
interaction term and the type effect 
were not signficant, redo your analy¬ 
sis as a one-way ANOVA with cola as 
the single factor. 

d. Create a matrix of paired differences, 
using the Bonferroni correction. 
Which pairs of colas are different in 
terms of their foam volume? 

e. Save your changes to the workbook and 
write a report summarizing your results. 

10. You’ve been given a workbook that con¬ 
tains information on 32 colleges from 
the 2008 edition of U.S. News and World 
Report’s “America’s Best Colleges,” 
which lists 248 national liberal arts col¬ 
leges in four tiers. The first two tiers are 
combined in a list of 125 colleges, and 
there are 61 in tier three and 62 in tier 
four. Splitting the 125 into 62 and 63, 
you have four tiers with 62, 63, 61, and 
62 colleges, respectively. A random sam¬ 
ple of eight was drawn from each of the 
four tiers, excluding nonprivate colleges. 
The data set includes Tier (from 1 to 4), 
College, Expenses (including tuition and 
fees but not room and board), and InState 


(the percentage of students who come 
from within the state). Perform the fol¬ 
lowing analysis on the data set: 

a. Open the Four Year workbook from 
the ChapterlO folder and save it as 
Four Year ANOVA. 

b. Create a multiple histogram of the 
tuition for different tier levels. Are there 
apparent problems with the normality 
and constant variance assumptions? 

c. Perform a one-way ANOVA to com¬ 
pare expenses in the four tiers. Does 
the tier affect the cost of attending a 
private college? 

d. Create a matrix of paired mean differ¬ 
ences. Does it cost significantly more 
to attend a college in a more presti¬ 
gious tier? 

e. Notice that the means for expenses 
decrease roughly linearly as the tier 
number increases. Accordingly, re¬ 
gress Expenses on Tier. Interpret the 
tier regression coefficient in terms of 
the drop in expenses when you move 
to a higher tier number. Conversely, 
how much more does it cost to attend 
a college in a more prestigious tier 
(with a lower tier number)? 

f. Save your changes to the workbook 
and write a report summarizing your 
results, stating whether it is more 
expensive to attend a highly rated col¬ 
lege and, if so, how the cost is related 
to the rating. Compare the regression 
and the ANOVA. 

11. The Four Year workbook of Exercise 10 
includes InState, the percentage of stu¬ 
dents coming from within the state. How 
does the InState variable depend on tier? 

a. Open the Four Year workbook from 
the ChapterlO data folder and save it 
as Four Year Instate ANOVA. 

b. Create a boxplot of InState broken 
down by tier. Notice the outlier in 
tier 4. Which college is it, and how 
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can you explain it? As a hint, con¬ 
sider that the college is in Vermont, 
which has a small population. Why is 
that relevant here? 

c. Perform a one-way analysis of vari¬ 
ance to compare tiers. Are there sig¬ 
nificant differences among tiers in the 
percentage of instate students? 

d. Create a matrix of paired mean dif¬ 
ferences. Does the first tier have sig¬ 
nificantly fewer instate students in 
comparison to the other three tiers? 

e. Redo the analysis in parts h and c hut 
this time do not include the outlier. 
How does this affect the results? 

f. Save your changes to the workbook and 
write a report summarizing your results. 

12. The Infield workbook data set has statis¬ 
tics on 120 major league baseball infield¬ 
ers at the start of the 2007 season. The 
data include Salary, logSalary (the loga¬ 
rithm of salary), and Position. 

a. Open the Infield workbook and save 
it as Infield Salary ANOVA. 

b. Create multiple histograms and box- 
plots to see the distribution of Salary 
for each position. How would you de¬ 
scribe the shape of the distribution? 

c. Make the same plots for logSalary. 

How does the shape of the distribution 
change with the logarithm of the salary? 

d. Perform a one-way ANOVA of logSal¬ 
ary on Position to see whether there 
is any significant difference of salary 
among positions. 

e. Save your changes to the workbook and 
write a report summarizing your results. 

13. The Infield workbook also contains the 
SLG variable, the slugging percentage of 
each infield player. Analyze the relation¬ 
ship between slugging percentage and 
position. 

a. Open the Infield workbook from the 
ChapterlO data folder and save it as 

Infield SLG ANOVA. 


b. Create multiple histograms and box- 
plots of the SLG variable against 
Position. Describe the shape of the 
distributions. Is there any reason to 
doubt the validity of the ANOVA 
assumptions? 

c. Perform a one-way ANOVA of SLG 
against Position. 

d. Create a matrix of paired mean dif¬ 
ferences to compare infield positions 
(use the Bonferroni correction factor). 
Which positions differ significantly? 
Can you explain why? 

e. Save your changes to the workbook 
and write a report summarizing your 
results. 

14. The Honda25 workbook contains the 
prices of used Hondas and indicates the 
age (in years) and whether the transmis¬ 
sion is 5 speed or automatic. 

a. Open the Honda25 workbook from 
the ChapterlO data folder and save it 
as Honda25 ANOVA. 

b. Perform a two-sample t test for the 
price data on the basis of the trans¬ 
mission type. 

c. Perform a one-way ANOVA with 
price as the dependent variable and 
transmission as the grouping variable. 

d. Compare the value of the t statistic 
in the t test to the value of the F ratio 
in the F test. Do you find that the F 
ratios for ANOVA are the same as the 
squares of the t values from the t test 
and that the p values are the same? 

e. Use one-way ANOVA to compare the 
ages of the Hondas for the two types of 
transmissions. Does this explain why 
the difference in price is so large? 

f. Perform two regressions of price vs. 
age—the first for automatic transmis¬ 
sions and the second for 5-speed 
transmissions. Compare the two lin¬ 
ear regression lines. Do they appear to 
be the same? What problems do you 
see with this approach? 
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g. Save your changes to the workhook 
and write a report snmmarizing yonr 
resnlts. 

15. The HondalZ workhook contains a snh- 
set of the Honda workhook in which the 
age variable is made categorical and has 
the valnes 1-3, 4-5, and 6 or more. Some 
observations of the workbook have been 
removed to balance the data. The vari¬ 
able Trans indicates the transmission, 
and the variable Trans Age indicates 

the combination of transmission and 
age class. 

a. Open the HondalZ workbook from 
the ChapterlO folder and save it as 

HondalZ ANOVA. 

b. Create a mnltiple histogram and box- 
plot of Price versns Trans Age. Does 
the constant variance assnmption for 
a two-way analysis of variance appear 
jnstified? 

c. Create an interaction plot of Price 
versns Trans and Trans Age (yon will 
need to create a pivot table of means 
for this). Does the plot give evidence 
for an interaction between the Trans 
and Trans Age factors? 

d. Perform a two-way analysis of vari¬ 
ance of Price on Trans Age and Trans 
(yon will have to create a two-way 
table nsing Trans Age as the row vari¬ 
able and Trans as the colnmn variable). 

e. Save yonr changes to the workbook 
and write a report snmmarizing yonr 
observations. 

16. At the Olympics, competitors in the 100- 
meter dash go throngh several rounds of 
races, called heats, before reaching the 
finals. The first ronnd of heats involves 
over a hnndred rnnners from conntries 
all over the globe. The heats are evenly 
divided among the premier rnnners so 
that one particnlar heat does not have an 
overabnndance of top racers. Yon decide 
to test this assnmption by analyzing data 


from the 1996 Snmmer Olympics in 
Atlanta, Georgia. 

a. Open the Race workbook from the 
ChapterlO folder and save it as Race 
Times ANOVA. 

b. Create a boxplot of the race times bro¬ 
ken down by heats. Note any large ontli- 
ers in the plot and then rescale the plot 
to show times from 9 to 13 seconds. Is 
there any reason not to believe, based 
on the boxplot, that the variation of race 
times is consistent between heats? 

c. Perform a one-way ANOVA to test 
whether the mean race times among 
the 12 heats are significantly different. 

d. Create a pairwise means matrix of the 
race times by heat. 

e. Save yonr workbook and snmmarize 
yonr conclnsions. Are the race times 
different between the heats? What is 
the significance level of the analysis 
of variance? 

17. Repeat Exercise 16, this time looking at 
the reaction times among the 12 heats and 
deciding whether these reaction times 
vary. Write yonr conclnsions and save 
yonr workbook as Race Reaction ANOVA. 

18. Another qnestion of interest to race 
observers is whether reaction times 
increase as the level of competition in¬ 
creases. Try to answer this qnestion by 
analyzing the reaction times for the 

14 athletes who competed in the first three 
ronnds of heats of the men’s 100-meter 
dash at the 1996 Snmmer Olympics. 

a. Open the Race Rounds workbook 
from the ChapterlO data folder and 
save it as Race Rounds ANOVA. 

b. Use the Analysis ToolPak’s ANOVA: 
Two-Factor Withont Replication com¬ 
mand to perform a two-way analysis 
of variance on the data in the Reaction 
Times worksheet. What are the two 
factors in the ANOVA table? 
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c. Examine the ANOVA table. What 
factors are significant in the analysis 
of variance? What percentage of the 
total variance in reaction time can be 
explained by the two factors? What is 
the value? 

d. Examine the means and standard de¬ 
viations of the reaction times for each 
of the three heats. Using these values, 
form a hypothesis for how you think 
reaction times vary with rounds. 

e. Test your hypothesis by performing 
a paired t test on the difference in 
reaction times between each pair of 
rounds (1 vs. 2, 2 vs. 3, and 1 vs. 3). 
Which pairs show significant dif¬ 
ferences at the 5% level? Does this 
confirm your hypothesis from the pre¬ 
vious step? 

f. Because there is no replication of a rac¬ 
er’s reaction time within a round, you 
cannot add an interaction term to the 
analysis of variance. You can still cre¬ 
ate an interaction plot, however. Create 
an interaction plot with round on the 
x-axis and the reaction time for each 
racer as a separate line in the chart. 

On the appearance of the chart, do 
you believe that there is an interaction 
between round and the racer involved? 
What impact does this have on your 
overall conclusions as to whether 
reaction time varies with round? 

g. Save your changes to the workbook 
and report your observations. 

19. Researchers are examining the effect of 
exercise on heart rate. They’ve asked 
volunteers to exercise by going up and 
down a set of stairs. The experiment has 
two factors: step height and rate of step¬ 
ping. The step heights are 5.75 inches 
(coded as 0] and 11.5 inches (coded as 1). 
The stepping rates are 14 steps/min 
(coded as 0), 21 steps/min (coded as 1), 
and 28 steps/min (coded as 2). The ex¬ 
perimenters recorded both the 


resting heart rate (before the exercise) 
and the heart rate afterward. Analyze 
their findings. 

a. Open the Heart workbook from the 
ChapterlO data folder and save it as 

Heart ANOVA. 

b. Create a two-way table using StatPlus. 
Place frequency in the row area of the 
table, place height in the column area 
of the table, and use heart rate after 
the exercise as the response variable. 

c. Analyze the values in the two-way 
table with a two-way ANOVA (with 
replication). Is there a significant in¬ 
teraction between the frequency at 
which subjects climb the stairs and 
the height of the stairs as it affects the 
subject’s heart rate? 

d. Create an interaction plot. Discuss 
why the interaction plot supports 
your findings from the previous step. 

e. Create a new variable named Change, 
which is the change in heart rate due 
to the exercise. Repeat parts a-c for 
this new variable and answer the 
question of whether there is an inter¬ 
action between frequency and height 
in affecting the change in heart rate. 

f. Save your changes to the workbook 
and write a report summarizing your 
conclusions. 

20 . The Noise workbook contains data from 
a statement by Texaco, Inc. to the Air and 
Water Pollution Subcommittee of the 
Senate Public Works Committee on June 
26,1973. Mr. John McKinley, president 
of Texaco, cited an automobile filter de¬ 
veloped by Associated Octel Company as 
effective in reducing pollution. However, 
questions had been raised about the ef¬ 
fects of filters on vehicle performance, 
fuel consumption, exhaust gas back¬ 
pressure, and silencing. On the last ques¬ 
tion, he referred to the data included here 
as evidence that the silencing properties 
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of the Octel filter were at least equal to 
those of standard silencers. 

a. Open the Noise workbook from the 
ChapterlO data folder and save it as 

Noise ANOVA. 

b. Create hoxplots and histograms of the 
Noise variable, broken down by the 
Size_Type variable (yon should edit 
the labels in the boxplot to make the 
plot easier to read). 

c. Create an interaction plot of the Noise 
variable for different levels of the Size 
and Type factors. Is there evidence of 
an interaction from the plot? 

d. Create a two-way table of the Noise 
data for the Size and Type factors. 

e. Using the two-way table, perform a 
two-way ANOVA on the data. What 
factors are significant? 

f. Save yonr changes to the workbook 
and write a report summarizing yonr 
conclnsions. 

21. The Waste workbook contains data from 
a clothing manufacturer. The firm’s qual¬ 
ity control department collects weekly 
data on percent waste, relative to what 
can be achieved by computer layouts of 
patterns on cloth. A negative value indi¬ 
cates that the plant employees beat the 
computer in controlling waste. Your job 
is to determine whether there is a sig¬ 
nificant difference among the five plants 
in their percent waste values. 

a. Open the Waste workbook from the 
ChapterlO folder and save it as Waste 
ANOVA. 

b. Create hoxplots of the waste value for 
the five plants. Are there any extreme 
outliers in the data that you should be 
concerned about? 

c. Perform a one-way analysis of vari¬ 
ance on the data. 

d. Create a matrix of paired mean differ¬ 
ences for the data. State your tentative 
conclnsions. 


e. Copy the waste data to another work¬ 
sheet in the workbook, and delete any 
observations that were identified as 
extreme outliers on the hoxplots. 

f. Redo your one-way ANOVA and 
means matrix on the revised data. 
Have your conclusions changed? 

g. Save your workbook and write a re¬ 
port summarizing your findings. 

22. Cuckoos lay their eggs in the nests of 
other host birds. The host birds adopt 
and then later hatch the eggs. The Eggs 
workbook contains data on the lengths 
of eggs fonnd in the nest of host birds. 
One theory holds that cuckoos lay their 
eggs in the nests of a particular host 
species and that they mate within a de¬ 
fined territory. If true, this would cause 
a geographical subspecies of cuckoos 
to develop and natural selection would 
ensure the survival of cuckoos most fit¬ 
ted to lay eggs that would be adopted by 
a particular host. If cuckoo eggs differed 
in length between hosts, this wonld lend 
some weight to that hypothesis. Yon’ve 
been asked to compare the length of the 
eggs placed in the different nests of the 
host birds. 

a. Open the Eggs workbook from the 
ChapterlO folder and save it as Eggs 
ANOVA. 

b. Perform a one-way ANOVA on the 
egg lengths for the six species. Is there 
evidence that the egg lengths differ 
between the species? 

c. Create a boxplot of the egg lengths. 

d. Analyze the pairwise differences be¬ 
tween the species by creating a means 
matrix. Use the Bonferroni correction 
on the p values. What, if any, differ¬ 
ences do you see between the species? 

e. Save your changes to the workbook 
and write a report summarizing your 
observations. 
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Chapter I I 


Time Series 


Objectives 

In this chapter you will learn to: 

P- Plot a time series 

P- Compare a time series to lagged values of the series 

P- Use the autocorrelation function to determine the relationship 
between past and cnrrent valnes 

P- Use moving averages to smooth out variahility 

P- Use simple exponential smoothing and two-parameter exponential 
smoothing 

P- Recognize seasonality and adjnst data for seasonal effects 

p- Use three-parameter exponential smoothing to forecast fntnre valnes 
of a time series 

P- Optimize the simple exponential smoothing constant 
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Time Series Concepts 

A time series is a sequence of observations taken at evenly spaced time in¬ 
tervals. The seqnence conld be daily temperatnre measnrements, weekly 
sales fignres, monthly stock market prices, qnarterly profits, or yearly power- 
consnmption data. Time series analysis involves looking for patterns that 
help ns understand what is happening with the data and help ns predict 
fntnre observations. For some time series data (for example, monthly sales 
fignres), yon can identify patterns that change with the seasons. This sea¬ 
sonal behavior is important in forecasting. 

Usnally the best way to start analyzing a time series is by plotting the data 
against time to show trends, seasonal patterns, and ontliers. If the variability 
of the series changes with time, the series might benefit from a transformation 
that stabilizes the variance. Constant variance is assnmed in mnch of time se¬ 
ries analysis, jnst as in regression and analysis of variance, so it pays to see 
first whether a transformation is needed. The logarithmic transformation is 
one snch example that is especially nsefnl for economic data. For example, 
if there is growth in power consnmption over the years, then the month-to- 
month variation might also increase proportionally. In this case, it might be 
nseful to analyze either the log or the percentage change, which shonld have 
a variance that changes little over time. 


Time Series Example: The Rise in 
Global Temperatures 

To illnstrate these ideas, yon’ve been provided the Global Temperatnre work¬ 
book (Sonrce: http://data.giss.nasa.gov/gistemp/tabIedata/GLB.Ts.txt). The 
workbook contains average annnal temperatnre readings compiled by NASA, 
covering the years 1880 throngh 1997. The NASA data are often nsed by 
climatologists investigating climate change and global warming. Table 11-1 
describes the range names and data contained in the workbook. 


Table 11 - 1 Global Temperature Workbook 


Range Name 

Range 

Description 

Year 

A2:A129 

The year 

Decade 

B2:B129 

The decade 

Celsins 

C2:C129 

The average annnal global temperatnre in 
degrees Celsins 

Fahrenheit 

D2:D129 

The average annnal global temperatnre in 
degrees Fahrenheit 
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To open the Global Temperature workbook: 

1 Open the Global Temperature workbook from the Chapterll data 
folder. 

2 Save the workbook as Global Temperature Analysis. The workbook 
appears as shown in Figure 11-1. 


Figure 11-I 
The Global 
Temperature 
workbook 
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Plotting the Global Temperature Time Series 

Before doing any computations, it is best to explore the time series graphi¬ 
cally. You’ll plot the average annual temperatures in degrees Fahrenheit. 

To plot the annual average temperature readings: 

1 Select the nonadjacent range Al:Al29;Dl:Dl29. 

2 Click the Scatter button from the Charts group on the Insert tab. 

3 Click the third scatter chart subtype (Scatter with Smooth lines). 

4 Scroll up to the top of the window, and with the chart still selected, 
click the Move Chart button from the Location group on the Design 
tab of the Chart Tools ribbon. Move the chart to a new chart sheet 
named Temperature Chart. 
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5 Remove the legend and gridlines from the chart. 

6 Change the chart title to Average Global Temperature. Change the 
X axis title to Year and the y axis title to Temperature (F). Figure 11-2 
shov\rs the edited chart. 


Figure 11 -2 
Time plot of 
annual global 
temperatures 
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The plot in Figure 11-2 shows a noticeable increase in mean annual tem¬ 
perature over the second half of the twentieth century. While the trend is strik¬ 
ing, there is still a great deal of variahility in the average annual temperature 
from year to year. We can smooth out this variahility hy plotting the averages 
per decade. 



To calculate the average global temperature per decade: 

Click the Temperature sheet tah. 

Click Descriptive Statistics from the StatPlus menu and then click 

Univariate Statistics. 

Click the Suimnary tah and select the Count and Average checkboxes. 

Click the Variability tab and select the Range and Std. Deviation 
checkboxes. 
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5 Click the General tab and click the Columns option bntton to dis¬ 
play the statistics by columns rather than by rows. 

6 Click the Input button and select Fahrenheit from the list of range 
names. Click OK. 

Now you’ll break down the statistics by decade. 

7 Click the By button and select Decade from the range names list. 
Click OK. 

8 Click the Output button and direct the output to a new worksheet 
named Temps by Decade. 

9 Click OK. 


Excel generates the table shown in Figure 11-3. 


Figure 11 -3 
Temperature 
statistics by 
decade 
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$ 
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12 
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10 
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13 
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10 

97 66000 
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U 

I 1990 

10 
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IS 

C>Kad*<2000 

0 

90 34929 

0 612 

0 193373 

1% 

1? 

Owfjl 

120 

9721716 

2 376 

0 900372 


Now that you’ve calculated the per-decade averages create a scatter plot 
of the values. 

To create a scatter plot of the decade averages: 

1 Select the nonadjacent cell range B3:B15;D3:D15 from the table of 
decade statistics. 

2 Click the Line button from the Charts group on the Insert tab and 
click the first chart subtype (Line). 

3 Move the line chart to a new chart sheet named Decades Chart. 
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4 Remove the legend from the chart. 

5 Add the chart title Average Global Temperature. Set the x axis title 
to Decade and the y axis title to Temperature (F). 

6 Change the nnmher format of the valnes on the y axis to one decimal 
place. Figure 11-4 shows the formatted scatter chart. 


Figure 11 -4 
Average 
temperature 
by decade 
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The chart clearly shows a minor dip in the decade average from 1940 to 
1960 after which there is a steady increase in global temperature up through 
the decade of the 2000s. 

Analyzing the Change in Global Temperature 

The changes in the temperature average from year to year are important. 
Your next step in examining these values is to analyze annual average tem¬ 
perature change. 

To calculate the change in the annual average temperature: 

1 Click the Temperature sheet tah to return to the data. 

2 Click cell El, type Change, and then press Enter. 
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3 Select the range E3:E129 [not E2:E129). 

4 Type =D3-D2 in cell E3; then press Enter. 

5 Press the Fill button lZI on the Editing group of the Home tah and 
then click Down. Excel fills the difference formula down the re¬ 
maining cells in the column, displaying the change in mean annual 
temperature from one year to the next. 


Now that you have calculated the differences in the mean annual tem¬ 
perature from one year to the next, you can plot those differences. 

To plot the change in the temperature versus time: 

1 Select the range Al:Al29, press and hold the CTRL key, and then 
select the range El:El29. 

2 Click the Scatter button from the Charts group on the Insert tab and 
click the first chart subtype (Scatter with only markers). 

3 Move the chart to a new chart sheet named Yearly Change. 

4 Remove the legend and gridlines from the plot. 

5 Enter the chart title Yearly Temperature Change. Set the title of the 
X axis to Year and the title of the y axis to Change in Temperature (F). 
Figure 11-5 shows the formatted scatter chart. 


Figure 11 -5 
Yearly 
differences 
in mean global 
temperature 
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There does not appear to be any trend in the change in mean annnal 
temperatnre over the 128 years represented in the data set. The changes in 
temperatnre valnes appear as scattered in recent years as they do in years 
from the beginning from the chart. 


Looking at Lagged Values 

Often in time series yon will want to compare a valne observed at one 
time point to the valne observed one or more time points earlier. In the tem¬ 
peratnre data, for example, yon might be interested in whether the mean 
temperatnre from one year can be nsed to predict the mean temperatnre 
of the following year. Snch prior valnes are known as lagged values. Lagged 
valnes are an important concept in time series analysis. Yon can lag observa¬ 
tions for one or more time points. In the example of the global temperatnre 
data, the lag 1 valne is the temperatnre valne one year prior, the lag 2 valne 
is the temperatnre valne two years prior, and so forth. 

Yon can calcnlate lagged valnes by letting the valnes in rows of the lagged 
colnmn be eqnal to valnes one or more rows above in the nnlagged colnmn. 
Let’s add a new colnmn to the Temperatnre worksheet, consisting of annnal 
temperatnre averages lagged one year. 

To create a column of lag 1 values for the global temperature data: 

1 Click the Temperature sheet tab to return to the data. 

2 Right-click the D column header so that the entire column is selected 
and the pop-up menu opens. Click Insert in the pop-up menu. 

3 Click cell Dl, type Lagl Temps (F), and press Enter. 

4 Select the range D3:D129 (not D2:D129). 

5 Type =E2 in cell D3 (this is the value from the previous year); then 
press Enter. 

6 Click the Fill button 3 from the Editing group on the Home tab and 
click Down. Excel fills in the rest of the column with the one-year 
lagged values. 


Each row of the lagged temperature values is equal to the temperature value 
of the previous year. You could have created a column of lag 2 values by select¬ 
ing the range D4:D129 and letting D4 be equal to E2, and so on. Note that for the 
lag 2 values you have to start two rows down, as compared to one row down for 
the lag 1 values. The lag 3 values would have been put into the range D5:D129. 

How do the temperature values compare to those of the previous year? To 
see the relationship between each temperature and its one-year lag value, 
create a scatterplot. 
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To create a scatterplot of temperature versus one-year lagged 
temperatures: 

1 Select the range Dl:El29. 

2 Click the Scatter button from Charts group on the Insert tah and then 
select the first chart suhtype (Scatter). 

3 Move the chart to the Lag 1 Chart chart sheet. 

4 Remove the gridlines and legends from the plot. Name the chart title 
Lagged Temperatures, the x axis title Prior Year Temperature (F), 
and the y-axis title Temperature (F). See Figure 11-6. 


Figure 11-6 
Temperature 
and LagI 
temperature 
values 
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As shown in the chart, there is a strong positive relationship between tem¬ 
perature value in one year and the temperature value from the previous year. 
This means that a high temperature value in one year implies a high (or above 
average) value in the following year; a low value in one year indicates a low 
(or below average) value in the next year. In time series analysis, we study the 
correlations among observations, and these relationships are sometimes help¬ 
ful in predicting future observations. In this example, the annual temperature 
value appears to be strongly correlated with the temperature value from the 
previous year. Temperatures might also be correlated with observations two. 
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three, or more years earlier. To discover the relationship hetw^een a time series 
and other lagged values of the series, statisticians calculate the autocorrela¬ 
tion function. 


The Autocorrelation Function 

If there is some pattern in hov\r the values of your time series change from 
observation to observation, you could use it to your advantage. Perhaps a 
below-average value in one year makes it more likely that the series will be 
high in the next year, or maybe the opposite is true—a year in which the series 
is low makes it more likely that the series will continue to stay low for a while. 

The autocorrelation function (ACF) is useful in finding such patterns. It 
is similar to a correlation of a data series with its lagged values. The ACF 
value for lag 1 (denoted by ri) calculates the relationship between the data 
series and its lagged values. The formula for is 

^ (vz - y)iyi - y) + (ja - y)iy 2 - y) + • ■ ■ + (y^ - y){yn-i - y) 

(ti - yY + {yz-yY + ■■■ + {yn - yY 

Here, represents the first observation, jyj the second observation, and 
so forth. Finally, y„ represents the last observation in the data set. Similarly, 
the formula for ra, the ACF value for lag 2, is 

^ (/a - y)(yi - y) + iyi - y)(yz - y) + • ■ • + (y^ - y)(yn-z - y) 

iyi - yY + iyz-yY + ■■■ + iyn - yY 

The general formula for calculating the autocorrelation for lag k is 

^ iyk+i - y)iyi - y) + iyk+z - y)iyz - y) + ■ ■ ■ + (jn - y)iyn-k - y) 

"" (7i - 7)^ + (72 - 7)^ + • • ■ + (7. - 7)^ 

Before considering the autocorrelation of the temperature data, let’s apply 
these formulas to a smaller data set, as shown in Table 11-2. 


Table 11 -2 Sample Autocorrelation Data 


Observation 

1 

2 

Values 

6 

4 

Lag 1 Values 

6 

Lag 2 Values 

3 

8 

4 

6 

4 

5 

8 

4 

5 

0 

5 

8 

6 

7 

0 

5 
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The average of the values is 5, is 6, is 4, is 8, and so forth, through 
y„, which is equal to 7. To find the lag 1 autocorrelation, use the formula for 
so that 


(4 - 5)(6 - 5) + (8 - 5)(4 - 5) + ■ ■ ■ + (7 - 5)(0 - 5) 
(6 - 5)2 + (4 - 5)2 + ■ ■ ■ + (7 - 5)2 


40 

In the same way, the value for rj, the lag 2 ACF value, is 

_ (8 - 5)(6 - 5) + (5 - 5)(4 - 5) + ■ ■ ■ + (7 - 5)(5 - 5) 
(6 - 5)2 + (4 - 5)2 + . . . + (7 - 5)2 


-12 


The values for and rj imply a negative correlation between the current 
observation and its lag 1 and lag 2 values (that is, the previous two val¬ 
ues). So a low value at one time point indicates high values for the next two 
time points. Now that you’ve seen how to compute and Tj, you should be 
able to compute r^, the lag 3 autocorrelation. Your answer should be 0.275, 
a positive correlation, indicating that values of this series are positively 
correlated with observations three time points earlier. 

Recall from earlier chapters that a constant variance is needed for statisti¬ 
cal inference in simple regression and also for correlation. The same holds 
true for the autocorrelation function. The ACF can be misleading for a series 
with unstable variance, so it might first be necessary to transform for a con¬ 
stant variance before using the ACF. 


Applying the ACF to Annual Mean Temperature 

Now apply the ACF to the temperature data. You can use StatPlus to com¬ 
pute and plot the autocorrelation values for you. 

To compute the autocorrelation function for the annual mean 
temperatures: 

1 Click the Temperature sheet tab. 

2 Click Time Series from the StatPlus menu and then click ACF Plot. 

3 Click the Data Values button and select Fahrenheit from the range 
names list. Click OK. 

4 Enter 20 in the Calculate ACF up through lag spin box to calculate 
the autocorrelations between the mean annual temperature values 
and mean temperatures up to 20 years earlier. 
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Figure 11-7 
Autocorrelation 
of the 
temperature 
data 

significant 

autocorrelations 


5 Click the Output button, and send the ontpnt to a new sheet named 
Temperature ACF. Click OK twice to close the dialog box and calcn- 
late the ACF fnnction. Fignre 11-7 shows the ontpnt from the ACF 
command. 


upper and lower 

95% confidence values 95% confidence interval 



The ontpnt shown in Fignre 11-7 lists the lags from 1 to 20 and gives the 
corresponding antocorrelations in the next colnmn. 

The lower and npper ranges of the antocorrelations are shown in the next 
two colnmns and indicate how low or high the correlation needs to be for 
statistical significance at the 5% level. Antocorrelations that lie ontside 
this range are shown in red in the worksheet. The plot of the ACF valnes 
and confidence widths gives a visnal pictnre of the patterns in the data. 
The two cnrves indicate the width of the 95% confidence interval of the 
antocorrelations. 

The antocorrelations are very high for the lower lag nnmbers, and they 
remain significant (that is, they lie ontside the 95% confidence width 
bonndaries) throngh lag 9. Specifically, the correlation between the mean 
annnal temperatnre and the mean annnal temperatnre of the previons year 
is 0.829 (cell B2). The correlation between the cnrrent temperatnre and the 
lag 2 valne is 0.738 (cell B3), and so forth. This is typical for a series that 
has a strong trend npward or downward. Given the increase in the global 
temperatnres dnring the latter half of the twentieth centnry, it shonldn’t be 
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surprising that high temperatures are correlated with the high temperatures 
of the previous year. In such a series, if an ohservation is above the mean, 
then its neighboring observations are also likely to be above the mean, 
and the autocorrelations with nearby observations are high. In fact, when 
there is a trend, the autocorrelations tend to remain high even for high lag 
numbers. 

STATPLUS TIPS_ 

• You can use StatPlus’s ACF(range, lag) function to compute 
autocorrelations for specific lag values. Here, range is the range 
of cells containing the time series data, and lag is the number 
of observations to lag. Note that values must be placed within a 
single column. 


Other ACF Patterns 

Other time series show different types of autocorrelation patterns. Figure 11-8 
shows four examples of time series (trend, cyclical, oscillating, and random), 
along with their associated autocorrelation functions. 


Figure 11-8 
Four sample 
time series with 
corresponding 
ACF patterns 



You have already seen the first example with the temperature data. The 
trend need not be increasing; a decreasing trend also produces the type of 
ACF pattern shown in the first example in Figure 11-8. 

The seasonal or cyclical pattern shown in the second example is common 
in weather data that follows a seasonal pattern (such as monthly average 
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temperature). The length of the cycle in this example is 12, indicated in the 
ACF hy the large positive antocorrelation for lag 12. Becanse the data in the 
time series follow a cycle of length 12, yon wonld expect that valnes 12 nnits 
apart wonld he highly correlated with each other. Seasonal time series mod¬ 
els are covered more thoronghly later in this chapter. 

The third example shows an oscillating time series. In this case, a large 
valne is followed hy a low valne and then hy another large valne. An ex¬ 
ample of this might he winter and snmmer sales over the years for a large 
retail toy company. Winter sales might always he above average hecanse of 
the holiday season, whereas snmmer sales might always he helow average. 
This pattern of oscillating sales might continne and conld follow the pattern 
shown in Fignre 11-8. The ACF for this time series has an alternating pat¬ 
tern of positive and negative antocorrelations. 

Finally, if the observations in the time series are independent or nearly inde¬ 
pendent, there is no discernible pattern in the ACF and all the antocorrelations 
shonld be small, as shown in the fourth example. This is characteristic of a 
random walk model in which cnrrent valnes are independent of previons 
valnes and thns yon cannot nse current valnes to predict fntnre ones. 

There are many other possible patterns of behavior for time series data 
besides the fonr examples shown here. 


Applying the ACF to the Change in Average 
Global Temperature 

Having looked at the antocorrelation fnnction for the mean annnal tempera- 
tnre, let’s look at the ACF for the change in the average global temperatnre. 
Does an increase in temperatnre in one year imply that the next year will 
also show an increase? Or is the opposite more likely, where years that show 
a large increase in temperatnre are followed by years in which the tempera¬ 
tnre increase is smaller or is even a decrease? Let’s find ont. 

To calculate the autocorrelation for the change in annual 
temperature: 

Click the Temperature sheet tab. 

Click Time Series from the StatPlus menu and then click ACF Plot. 

Click the Data Values button, click the Use Range References option 
button, and select the range F3:F129. 

Deselect the Range includes a row of column labels checkbox and 
click OK. 

You want to deselect this checkbox because this selection does not 
include a header row. 
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5 Enter 20 in the Calcnlate ACF np throngh the lag spin hox. 

6 Click the Output button, and send the output to a new sheet named 
Change ACF. Click OK twice. 

Figure 11-9 shows the output from the command. 



The autocorrelations for change in average temperature are not as strong 
as you saw earlier using the yearly temperature values. However note that 
the lag 1 and lag 2 correlations are hoth statistically significant and negative. 
This indicates a negative correlation between the current change in tem¬ 
perature and changes from one or two years prior. Apparently an increase 
in temperature in one year is associated with a smaller increase or even a 
decrease in the next two years. 


Moving Averages 

As you saw earlier in Figure 11-5, the change in average temperature can 
vary unpredictably from one year to another. One way of smoothing out this 
fluctuation is to take the average change over an entire decade as you did 
for the yearly temperature values. Another way of smoothing your data is to 
calculate a moving average. 
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For example, if you calculate the average change for each of the last five 
years and yon do this every year, yon are forming a moving average of those 
valnes as yon move forward in time. Specifically, to calculate the five-year 
moving average for valnes prior to the observation y„, yon define the moving 
average ynia( 5 ) snch that 

_ Yn-l + Yn-Z + yn-3 + Yn-i + Vn-S 
yma(5) j- 

The nnmher of observations nsed in the moving average is called the 
period. Here the period is 5. 

Excel provides the ability to add a moving average to a scatterplot nsing 
the Insert Trendline command. Let’s add a five-year moving average to the 
change in the temperatnre for the valnes from the workbook. 

To add a moving average to a chart: 

1 Click the Yearly Change chart sheet tab. 

2 Right-click the data series (any data valne) on the chart to select it 
and open the shorten! menn. 

3 Click Add Trendline in the shortent menn. 

4 Click the Trendline Options list item if it is not already selected, 
click the Moving Average option bntton, and then click the Period 
np spin arrow nntil 5 appears as the period. Yonr dialog box shonld 
look like Fignre 11-10. 
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Figure 11-10 
The Format 
Trendline 
dialog box 



5 Click the Close button and then click outside the chart to deselect the 
data series. The moving-average curve appears as in Figure 11-11. 
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Taking the five-year moving average smoothes ont the data a hit; however, 
even the smoothed valnes show so mnch flnctnation that it is difficnlt to 
spot a clear trend (if one exists). We can edit the trendline to increase the 
period of the moving average in an attempt to fnrther smooth the data, hnt 
we shonld nse cantion hecanse in smoothing the data some crncial informa¬ 
tion conld he lost. 


EXCEL TIPS 



Excel’s Analysis ToolPak also inclndes a command to calcnlate 
a moving average and display the moving-average valnes in a 
chart. To rnn the command, open the Data Analysis ToolPak dia¬ 
log hox and select Moving Average from the list of analysis tools. 


Simple Exponential Smoothing 

The moving average gives eqnal weight to all previons valnes in the moving 
average period. Thns with a five-year period, a valne recorded five years ago 
is given as much weight as the value from the previous year. Some feel that 
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an approach that gives equal weight to all observations within the period 
is not always reasonable. For example, if the belief that increased indus¬ 
trialization is accelerating the effects of human-made global warming and 
we want to predict future temperature values, we may want to give greater 
weight to the most recent observations and lesser weight to observations 
further in the past. 

Many analysts advocate a moving average that gives greater weight to more 
recent values and one in which the value of the weights drops off exponen¬ 
tially. This kind of moving average is not limited to a set period of values 
but gives some weight to all observations in the data set. The most recent 
observation gets weight w, where the value of w ranges from 0 to 1. The 
next most recent observation gets weight w(l — w), the one before that gets 
weight w(l — w)^, and so on. In general, the weight assigned to an observa¬ 
tion k units prior to the current observation is equal to w(l — The 

exponentially weighted moving average is therefore 

Exponentially weighted average 

= wyn-i + wil - w)yn -2 + w{l - + • ■ ■ 

Here w is called a smoothing factor or smoothing constant. This tech¬ 
nique is called exponential smoothing or, specifically, one-parameter 
exponential smoothing. Table 11-3 gives the weights for prior observations 
under different values of w. 


Table 11 -3 Exponential Weights 


.Tn-l 

J n-2 

J n-3 

■yn-4 

■rn-S 

J n-6 

W 

w(l — pv) 

w{l-w]^ 

w{l - w)® 


w[l — pv) 

0.01 

0.0099 

0.0098 

0.0097 

0.0096 

0.0095 

0.15 

0.1275 

0.1084 

0.0921 

0.0783 

0.0666 

0.45 

0.2475 

0.1361 

0.0749 

0.0412 

0.0226 

0.75 

0.1875 

0.0469 

0.0117 

0.0029 

0.0007 


As the table indicates, different values of w cause the weights assigned to 
previous observations to change. For example, when w equals 0.01, approxi¬ 
mately equal weight is given to a value from the most recent observation and 
to values observed six units earlier. However, when whas the value of 0.75, 
the weight assigned to previous observations quickly drops, so that values 
collected six units prior to the current time receive essentially no weight. In 
a sense, you could say that as the value of w approaches zero, the smoothed 
average has a longer memory, whereas as w approaches 1, the memory of 
prior values becomes shorter and shorter. 
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Forecasting with Exponential Smoothing 

Exponential smoothing is often nsed to forecast the valne of the next ob¬ 
servation, given the cnrrent and prior valnes. In this sitnation, yon already 
know the valne of and are trying to forecast the next valne Call the 
forecast S^,. The formnla for S„ is similar to the one we derived for the expo¬ 
nentially weighted moving average; it is 

Sn= Wn + ^{1 - -f w(l - w)^y „_2 + ■■■ + w{l - + (l - w)"So 

Sj, is more commonly written in an eqnivalent recnrsive formnla, where 

Sn= Wn+ - W^Sn-l 

SO that S„ is eqnal to the snm of the weighted valnes of the cnrrent observa¬ 
tion and the previons forecast. Therefore, to create the forecasted valne, an 
initial forecasted valne Sq is reqnired. One option is to let Sq eqnal y^, the 
initial observation. Another choice is to let Sq eqnal the average of the first 
few valnes in the series. The examples in this chapter will nse the first op¬ 
tion, setting Sq eqnal to the first valne in the time series. 

Once yon determine the valne of Sq, yon can generate the exponentially 
smoothed valnes as follows: 

Sj = wy-i -f (l — w)Sq 

Sj = wy2 -I- (l — w)Si 

Sn = Wn + 

and then S„ becomes the valne yon predict for the next observation in the 
time series. 


Assessing the Accuracy of the Forecast 

Once yon generate the smoothed valnes, how do yon measnre their accnracy 
in forecasting valnes of the time series? One way is to nse exponential smooth¬ 
ing to calcnlate y^, the predicted valne of the time series at time t. Then, for 
each valne in the time series, compare y^ to the observed valne, y,. The mean 
square error (MSE), gives the snm of the sqnared differences between the 
forecasted valnes and the observed valnes. The formnla for the MSE is 

MSE = —— 

n 

By comparing the MSE of one set of smoothed valnes to another, one can 
determine which set does a better job of forecasting the data. 
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The square root of the MSB gives us the standard error, which indicates 
the magnitude of the typical forecasted error. A standard error of 5 wonld 
indicate that the forecasts are typically off hy ahont 5 points. 

Another way of measnring the magnitnde is to take the snm of the ahso- 
Inte valnes of the differences between the forecasted and observed valnes. 
This measnre, called the mean absolute deviation (MAD), has the formula 


MAD 


^t=\ 


Yt - Yt 


n 


One of the differences between the MAD and the MSB is that the MAD 
does not penalize a forecast as much for very large errors. Because the MSB 
squares the deviations, large errors become even more prominent. 

Another measure is the mean absolute percent error (MAPE), which ex¬ 
presses the accuracy as a percentage of the observed value. The formula for 
the MAPB is 


MAPB = 



(Yt - yt)/Yt 


n 


X 100 


To help you get a visual image of the impact that differing values of w 
have on smoothing the data and forecasting the next value in the series, you 
can open the Bxponential Smoothing workbook. 



CONCEPT TUTORIALS 

One-Parameter Exponential Smoothing 


To use the Exponential Smoothing workbook: 

1 Open the Exponential Smoothing workbook from the Explore folder. 
Enable the macros in the workbook. 

2 Review the contents of the workbook up to the section entitled 
Explore One-Parameter Exponential Smoothing. The worksheet is 
shown in Figure 11-12. 
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This worksheet shows the ohserved percentage changes in a sample time 
series overlaid with the one-parameter exponentially smoothed valnes. The 
smoothing factor wis set at 0.15. In the lower-right corner, the worksheet 
contains an area cnrve indicating the magnitnde of the weights assigned to 
the observations prior to the last valne in the series. 

The final forecasted valne is 0.028. The most recent observation has the 
most weight in calcnlating this resnlt, with observations decreasing ex¬ 
ponentially in importance. Comparing the cnrve to the time series tells 
yon that the large drop in the middle of the time series has little weight 
in estimating the final valne. In fact, observations prior to that valne have 
negligible impact. 

The mean sqnare error is 0.088 and the standard error is 0.297, showing 
that if yon had nsed exponential smoothing on this data yonr typical error 
in forecasting wonld have been abont 0.297 points. 

One way of choosing a valne for the smoothing constant is to pick the 
valne that resnlts in the lowest mean sqnare error. Let’s see what happens 
to the mean sqnare error when yon decrease the valne of the smoothing 
constant. 
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To decrease the value of the smoothing constant: 

Click the down spin button repeatedly to reduce the value of w to 
0.03. The forecasted values and the weight assigned to prior obser¬ 
vations change dramatically. See Fignre 11-13. 



With snch a small valne for w, the smoothed valne has a long memory. In 
fact, the final forecasted valne, 0.023, is based in some part on observations 
spanning the entire time series. A conseqnence of having snch a small valne 
for w is that individnal events, snch as the large drop-off in the middle of 
the time series, have a minor impact on the smoothed valnes. The line of 
forecasted valnes is practically straight. Note as well that the standard error 
has declined from 0.293 to 0.281. In that case, the time series data is best 
estimated by the overall average or smoothed valne that has a long memory. 

Now increase the valne of the smoothing factor to make the forecasts 
more snsceptible to nnit-by-nnit changes. 
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Figure 11-14 
Increasing the 
value of w 
to 0.60 


forecasted values 
are more variable 


only recent values are 
heavily weighted 


To increase the smoothing factor: 

Click the up spin button repeatedly to increase the value of w to 
0.60. See Figure 11-14. 
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With a larger value for w, the forecasted values are much more 
variable—almost as variable as the observations themselves. This is 
a result of so much weight assigned to the value immediately prior to 
the current value. If one value shows a large upward swing, then the 
forecasted value for the next value tends to be high. As w approaches 1, 
the forecasted values appear more and more like lag 1 values. 

Continue trying different values for w to see how they affect the 
smoothed curve and the standard error of the forecasts. Can you 
find a value for wthat results in forecasts with the smallest standard 
error? 

Close the Exponential Smoothing workbook without saving your 
changes. You’ll return to the workbook later in this chapter. 
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Choosing aValue for w 


As you saw in the Exponential Smoothing workbook, yon have to choose the 
valne of w with care. When choosing a valne, keep several factors in mind. 
Generally, yon want the standard error of the forecasts to he low, hnt this is 
not the only consideration. The valne of w that gives the lowest standard 
error might he very high (snch as 0.9] so that the exponential smoothing does 
not resnlt in very smooth forecasts. If yonr goal is to simplify the appearance 
of the data or to spot general trends, yon wonld not want to nse snch a high 
valne for w, even if it prodnced forecasts with a low standard error. Analysts 
generally favor valnes for w ranging from 0.01 to 0.3. Choosing appropriate 
parameter valnes for exponential smoothing is often based on intnition and 
experience. Nevertheless, exponential smoothing has proved valnable in 
forecasting time series data. 

The ability to perform exponential smoothing on time series data has 
been provided for yon with StatPlns. Let’s smooth the temperatnre data, 
nsing a w valne of 0.18. 

To create exponentially smoothed forecasts of the mean annual 
temperature: 

1 Return to the Global Temperature Analysis workbook and go to the 
Temperature worksheet. 

2 Click Time Series from the StatPlns menu and then click Exponen¬ 
tial Smoothing. 

3 Click the Data Values button and select Fahrenheit from the list of 
range names. Click OK. 

4 Type 0.18 in the Weight box under General Options. 

5 Click the Output button and direct the output to a new worksheet 
named Smoothed Temperature. Click OK. 

The completed dialog box appears in Figure 11-15. 
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Figure 11-15 
The Perform 
Exponential 
Smoothing 
dialog box 



6 Click OK. 

Excel displays the worksheet shown in Fignre 11-16. 


Figure 11-16 
Smoothed 
temperatures 


mean annual forecasted descriptive 
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The output shown in Fignre 11-16 consists of three colnmns: the observa¬ 
tion nnmhers, the recorded mean annnal temperatnres, and the temperatnres 
forecasted for each year based on the smoothing model. The valnes are then 
plotted on the chart. It appears that the forecasted valnes generally nnderes- 
timated the mean annnal temperatnres in the last decades of the twentieth 
centnry. This may indicate that temperatnres are warming faster than 
expected. The lower forecasted valnes might also reflect the effect of the 
slight dip in temperatnre valnes that occnrred dnring the middle decades of 
the centnry. The standard error of the forecasts, 0.250864, indicates that the 
typical forecasting error was abont 0.25 degrees Fahrenheit points per year. 

The one-parameter exponential smoothing only nses weighted averages 
of previons observations to calcnlate futnre results. It does not assnme a par- 
ticnlar trend for the time series, bnt it is apparent from the data that the tem¬ 
peratnre valnes have been increasing over the time interval being stndied. 
We can insert a trend assnmption into onr model by nsing two-parameter 
exponential smoothing. 

EXCEL TIPS_ 

^ - • Excel’s Analysis ToolPak also inclndes a command to perform 

one-parameter exponential smoothing. To rnn the command, 
select Exponential Smoothing from the list of analysis tools in 
the Data Analysis ToolPak. 


Two-Parameter Exponential Smoothing 

To explore how to add a trend assnmption to exponential smoothing let’s 
first express one-parameter exponential smoothing in terms of the following 
eqnation for jf, the valne of the y variable at time t. 

where /3o is the location parameter that changes slowly over time, and 
Sj is the random error at time t. If (Sq were constant thronghont time, yon 
conld estimate its valne by taking the average of all the observations. Using 
that estimate, yon wonld forecast valnes that wonld always be eqnal to 
yonr estimate of (Sq. However, if /Sq varies with time, yon weight the more 
recent observations more heavily than distant observations in any forecasts 
yon make. Snch a weighting scheme conld involve exponential smoothing. 
How conld snch a sitnation occnr in real life? Consider tracking crop yields 
over time. The average yield conld slowly change over time as eqnipment 
or soil science technology improved. An additional factor in changing the 
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average yield would be the weather, because a region of the country might 
go throngh several years of dronght or good weather. 

Now snppose the valnes in the time series follow a linear trend so that 
the series is better represented by this eqnation. 

Yt = 1^0 + Pit + e, 

where /3i is the trend parameter, whose valne can also change over time. 
If Pq and Pi were constant thronghont time, yon conld estimate their valnes 
nsing simple linear regression. However, when the valnes of these para¬ 
meters change, yon can try to estimate their valnes nsing the same smooth¬ 
ing techniqnes yon nsed with one-parameter exponential smoothing (this 
approach is known as Holt’s method). This type of smoothing estimates a 
line fitting the time series, with more weight given to recent data and less 
weight given to distant data. A connty planner might nse this method to 
forecast the growth of a snbnrb. The planner wonld not expect the rate of 
growth to be constant over time. When the snbnrb was new, it conld have 
had a very high growth rate, which might change as the area becomes satn- 
rated with people, as property taxes change, or as new commnnity services 
are added. In forecasting the probable growth of the commnnity, the planner 
tends to weight recent growth rates mnch more heavily than older ones. 


Calculating the Smoothed Values 

The formnlas for two-parameter smoothing are very similar in form to the 
simple one-parameter eqnations. Define S„ to be the valne of the location 
parameter for the nth observation and T„ to be the trend parameter. Becanse 
we have two parameters, we also need two smoothing constants. We’ll nse 
the familiar w constant for smoothing the estimates of S^, and we’ll call t 
the smoothing constant for T„. Using the same recnrsive form as was dis- 
cnssed with one-parameter exponential smoothing, we calcnlate and 
as follows: 


Sn= Wn+ i't - w){S„_i + r„_i) 

T„ = tis„ - + (1 - 

and the formnla for the forecasted valne oiy„ + i is 

Yn + l = Sn+ Tn 

The valnes of the parameters need not be eqnal. Althongh the eqnations 
may seem complicated, the idea is fairly straightforward. The valne of is 
a weighted average of the cnrrent observation and the previons forecasted 
valne. The valne of r„ is a weighted average of the change in and the previ¬ 
ons estimate of the trend parameter. As with simple exponential smoothing, 
yon mnst determine the initial valnes Sg and Tg. One method is to fit a lin¬ 
ear regression line to the entire series and nse the intercept and slope of the 
regression eqnation as initial estimates for the location and trend parameters. 
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CONCEPT TUTORIALS 

Two-Parameter Exponential Smoothing 


The Exponential Smoothing workbook that yon nsed earlier also contains 
an interactive tntorial on two-parameter exponential smoothing. 


To view the Exponential Smoothing workbook: 

1 Retnrn to the Exponential Smoothing file in the Explore folder. 
Enable the macros in the workbook. 

2 Scroll throngh the workbook nntil yon reach “Explore Two-Parameter 
Exponential Smoothing.” See Fignre 11-17. 


Figure 11-17 
Exploring 
two- 
parameter 
exponential 
smoothing 


Sample Time Series Data 



Location constant (w); 

0.15 

U Trend constant 

0.15 



tJ 


Observations 

128 

Forecasted Value 

58 295 

Mean Squared Error: 

0.066 

Forecasted Trend: 

0 042 

Standard Error. 

0266 





-forecasted values 


relative weights 
-used in forecasting 
the trend parameter 


The worksheet shows data from a sample time series. The smoothing 
factor for the location is eqnal to 0.15, as is the smoothing factor for trend. 
The area cnrves at the bottom of the chart indicate the relative weights as¬ 
signed to previons valnes in calcnlating the final forecast for the location 
and trend parameters. For t and w eqnal to 0.15, the most prominent obser¬ 
vations occnr within a few nnits of the cnrrent valne. Earlier observations 
have too little effect to be visible on the chart. On the basis of the two- 
parameter exponential smoothing estimates, the forecasted valne of the time 
series is projected to be abont 58.295, increasing at a rate of 0.042 points per 
nnit of time. 
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The values chosen for w and t are important in determining what the 
forecasted valne for the time series will he. If we assnme that the data will 
continne to hehave as it did for earlier valnes, smaller valnes for w and t 
might he nsed, hecanse those wonld resnlt in an estimate that has a longer 
“memory” of previons values. Let’s see what kind of difference this wonld 
make hy redncing the valne of t from 0.15 to 0.05. 


To reduce the value of t: 

I Repeatedly click the down spin arrow next to the Trend constant 
nntil the valne of t eqnals 0.05. See Fignre 11-18. 


Figure 11-18 
Decreasing 
the value of 
t to 0.05 



a positive trend 
is forecast 

observations further 
back in time are 
heavily weighted in 
forecasting the trend 


With this valne of t, the forecasted increase in the time series data changes 
to 0.030 points per nnit, reflecting the assnmption that there will he an 
increase in the data similar to what was observed earlier in the time series. 
Note that the weights for the trend factor, as shown hy the area cnrve, indi¬ 
cate that older observations are well represented in the forecast. Now let’s 
see what wonld happen if we increased the valne of t, focnsing more on 
short-term trends. 


I 


To increase the value of t: 

Repeatedly click the up spin arrow next to the Trend constant nntil 
the valne of t eqnals 0.40. See Fignre 11-19. 
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Figure 11-19 
Increasing 
the value of 
t to 0.40 



a negative trend 
is forecast 


only the most 
recent observations 
are used in 
forecasting the trend 


With a higher smoothing constant, the forecasted trend of the time series 
shows an increase of 0.044 points per unit of time. The area curve indicates 
that the smoothed trend estimate has a shorter memory; only the most re¬ 
cent observations are relevant in estimating the trend. 

Using this worksheet, you can change the values of the smoothing con¬ 
stant for the location and trend parameters. What comhinations result in the 
lowest values for the standard error? When you are finished with your inves¬ 
tigations, close the workbook. You do not have to save any of your changes. 

Now let’s return to the global temperature data. In using one-parameter 
exponential smoothing the forecasted values underestimated the most recent 
trend in temperature data. To compensate we’ll use two-parameter exponential 
smoothing in an attempt to “pick up” the most recent trend of increasing 
temperatures. 

To forecast global temperatures using two-parameter exponential 
smoothing: 

1 Return to the Global Temperature Analysis workbook and go to the 
Temperature worksheet. 

2 Click Time Series from the StatPlus menu and then click Exponen¬ 
tial Smoothing. 

3 Click the Data Values button and select Fahrenheit from the list of 
range names. Click OK. 


Chapter 11 Times Series 461 






























4 Click the Linear Trend option button to add a linear trend to the 
forecasted temperature values. 

5 Click the Output button and specify the worksheet Smoothed 
Temperatures 2 as the output worksheet. Click the OK button twice. 


Figure 11-20 shows the forecasted temperature values using two-parameter 
exponential smoothing. 


Figure 11 -20 
Forecasting 
global 
temperatures 
with 

two-parameter 

exponential 

smoothing 



By adding the trend parameter our smoothed values have picked up the 
recent trend in rising global temperatures indicated in the data. At this point 
you can close the Global Temperature Analysis workbook. 


Seasonality 

Often time series are measured on a seasonal basis, such as monthly or 
quarterly. If the data are sales of ice cream, toys, or electric power, there is 
a pattern that repeats each year. Ice cream and electric power sales are high 
in the summer, and toy sales are high in December. 


Multiplicative Seasonality 

If the sales of some of your products are seasonal, you might want to adjust 
your sales for the seasonal effect, in order to compare figures from month 
to month. To compare November and December sales, should you use the 
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difference of the values or the ratio? In many cases seasonal changes are hest 
expressed in ratios, especially if there is substantive growth in yearly sales. 

As annual sales increase, the difference between the November and 
December values should also increase, but the ratio of sales between the two 
months might remain nearly constant. This is called multiplicative seasonality. 
To quantify the effect of the season on each month’s value, we need to as¬ 
sign a multiplicative factor to each month. If the month’s sales are equal to 
the expected yearly average, we’ll give it a multiplicative factor of 1. Conse¬ 
quently, months with higher-than-average sales have multiplicative factors 
greater than 1, and months with lower-than-average sales have multiplica¬ 
tive factors less than 1. 

As an example, consider Table 11-4, which shows seasonal sales and 
multiplicative factors. 


Table 11 -4 Multiplicative Seasonality 



Jan 

Feb 

Mar 

Apr 

May 

Jun 

Jul 

Aug 

Sep 

Oct 

Nov 

Dec 

Sales 

220 

310 

359 

443 

374 

660 

1030 

1320 

1594 

1093 

950 

610 

Factor 

0.48 

0.58 

0.60 

0.69 

0.59 

1.00 

1.48 

1.69 

1.99 

1.29 

1.02 

0.59 

Adjusted 458.3 
sales 

534.5 

598.3 

642.0 

633.9 

660.0 

695.9 

781.1 

801.0 

847.3 

931.4 

1033.9 


The monthly sales figures are shown in the first row of the table. The mul¬ 
tiplicative factors based on previous years’ sales are shown in the second row. 
Dividing the sales in each month by the multiplicative factor yields the adjusted 
sales. Plotting the sales values and adjusted sales values in Figure 11-21 reveals 
that sales have been steadily increasing throughout the year. This information 
is masked in the raw sales data by the seasonal effects. 


Figure 11-21 
Plot of adjusted 
sales data 
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Additive Seasonality 


Sometimes the seasonal variation is expressed in additive terms, especially 
if there is not mnch growth. If the highest annnal sales total is no more than 
twice the lowest annnal sales total, it prohahly does not matter whether yon 
nse differences or ratios. If yon can express the month-to-month changes in 
additive terms, the seasonal variation is called additive seasonality. Addi¬ 
tive seasonality is expressed in terms of differences from the expected aver¬ 
age for the year. In Table 11-5, the seasonal adjnstment for December sales is 
— 240, resnlting in an adjnsted sales for that month of 681. After adjnstment 
for the time of the year, December tnrned ont to be one of the most snccessfnl 
months, at least in terms of exceeding goals. 


Table 11 -5 Additive Seasonality 



Jan 

Feb 

Mar 

Apr 

May 

Jun 

Jul 

Aug 

Sep 

Oct 

Nov 

Dec 

Sales 

298 

378 

373 

443 

374 

660 

1004 

1153 

1388 

904 

715 

441 

Factor 

-325 

-270 

-270 

-200 

-280 

-55 

350 

450 

550 

220 

70 

-240 

Adjusted 

sales 

623 

648 

643 

643 

654 

715 

654 

703 

838 

684 

645 

681 


In this chapter you’ll work with multiplicative seasonality only, but you 
should be aware of the principles of additive seasonality. 


Seasonal Example: Liquor Sales 

Are liquor sales seasonal? The Liquor workbook has monthly liquor store 
sales in terms of millions of dollars from January 1996 through December 
2007. The workbook contains the variables and reference names shown in 
Table 11-6. 


Table 11 -6 The Liquor Workbook 


Range Name 

Range 

Description 

Year 

A2:A145 

The year 

Month 

B2:B145 

The month 

Year_Month 

C2:C145 

The year and the month 

Sales 

D2:D145 

The monthly liquor sales in millions of dollars 
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To open the Liquor workbook: 

1 Open the Liquor workbook from the Chapterll data folder. 

2 Save the workbook as Liquor Sales Analysis. The workbook appears 
as shown in Fignre 11-22. 


Figure 11 -22 
The Liquor 
Sales Analysis 
workbook 



As a first step in analyzing these data, create a line plot of sales versns year 
and month. 

To create a time series plot: 

1 Select the range Cl:Dl45 and click the Line button from the Charts 
group on the Insert tab. Select the first chart subtype (Line). 

2 Move the chart to the chart sheet Sales Chart. 

3 Enter the chart title Liquor Sales 1996-2007, enter Year for the 
x-axis title, and enter Sales ($mil) for the y-axis title. Remove the 
gridlines and legend from the plot. 

4 Click the Axes button from the Axes group on the Layout tab of the 
Chart Tools ribbon. Click Primary Horizontal Axis and then click 

More Primary Horizontal Axis Options. 
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Within the Format Axis dialog hox, click the Specify interval unit 
option hntton and enter 24 for the nnmher of nnits between axis 
labels. See Fignre 11-23. 


Figure 11 -23 
Format Axis 
dialog box 



6 


Click the Close hntton. Fignre 11-24 shows the formatted chart of 
liqnor sales. 
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The plot shows that production is seasonal, with peaks occnrring each winter 
(aronnd the holidays). There also appears to he another peak in the snmmer, 
perhaps aronnd the fonrth of July. In addition to the seasonality of the data, 
there appears to he a linear trend of increasing sales from 1996 to 2007. 


Examining Seasonality with a Boxplot 


One way to see the seasonal variation is to make a hoxplot, with a hox for 
each of the 12 months. This gives yon a pictnre of the month-to-month vari¬ 
ation in liqnor sales. The shape of each hox tells yon how prodnction for 
that month varied from 1996 to 2007. 


1 

2 

3 

4 


To create the boxplot: 

Click the Liquor Sales sheet tah. 

Click Single Variable Charts from the StatPlns menn and click 

Boxplots. 

Click the Connect Medians between Boxes checkbox. 

Click the Data Values button and select Sales from the range names 
list. Click OK. 
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6 

7 

8 


Click the Categories button and select Month from the list of range 
names. Click OK. 

Click the Output button and direct the plot to a new chart sheet 
named Sales Boxplot. Click OK twice. 

Rescale the y axis to go from 1000 to 5000. 

Insert the chart title Liquor Sales. Add the x-axis title Month and 
the y-axis title Sales ($mil). Edit the labels at the bottom of the 
boxplot, removing the “Month =” text from each. 

Figure 11-25 shows the edited boxplot for the liquor sales data. 


Figure 11-25 
Liquor sales 
boxplot 



The boxplot in Figure 11-25 shows how monthly liquor sales vary 
throughout the year. The sales peak in December shows a slight dip in the 
months of September and October. The boxplot also indicates the range 
of production levels for each month. There are a couple of outliers in the 
months of March and June, but there is nothing extreme. 

Examining Seasonality with a Line Plot 

You can also take advantage of the two-way table to create a line plot of sales 
versus month for each year of the data set. This is another way to get insight 
into the monthly sales figures during this time period. You will first have to 
create a two-way table of the data in the Liquor Sales worksheet. 
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To create a line plot of liquor sales: 

1 Return to the Liquor Sales worksheet. 

2 Click Manipulate Columns from the StatPlus menu and then click 

Create Two-way Table. 

3 Select Sales for the Data Values variable, Month for the Column 
Levels, and Year for the Row Levels. 

4 Deselect the Sort the Column Levels checkbox. 


Note: You want to deselect this checkbox to prevent the two-way 
table from sorting the columns in alphabetical order, rather than 
leaving them in time order. 

5 Send the two-way table to a new worksheet named Sales by Year. 

6 Click OK. 

7 Go to the Sales by Year worksheet. 

8 Select the range B2:M14 and click the Line button from the Charts 
group on the Insert tab. Click the first chart subtype (Line). Move the 
chart to a new chart sheet named Liquor Sales Line Plot. 

9 Enter Liquor Sales versus Month for the chart title. Months for the 
x-axis title, and Sales ($niil) for the y-axis title. Remove the legend and 
gridlines from the plot. 

Figure 11-26 shows the formatted line chart. 


Figure 11 -26 
Line plot 
of monthly 
liquor sales 
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The line plot in Fignre 11-26 demonstrates the seasonal natnre of the data 
and also allows yon to observe individnal values. Plots like this are some¬ 
times called spaghetti plots, for obvious reasons. 


Applying the ACF to Seasonal Data 


You can also use the autocorrelation function to display the seasonality of 
the data. For a seasonal monthly series, the ACF should be very high at lag 
12, because the current value should be strongly correlated with the value 
from the same month in the previous year. 


To calculate the ACF: 

1 Return to the Liquor Sales worksheet. 

2 Click Time Series from the StatPlus menu and then click ACF Plot. 

3 Select Sales for the Data Values variable. 

4 Click the up spin arrow to calculate the ACF up through a lag of 24. 

5 Send the output to a new worksheet named Liquor Sales ACF. 

6 Click OK. 

Figure 11-27 shows the ACF of the liquor sales data. 


Figure 11 -27 
ACF of the 
liquor sales 
data 
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The autocorrelation holds between adjacent months hnt the highest 
antocorrelation exists for sales that are 12 months or one year apart. Also 
note a second significant antocorrelation occnrs at 24 months. So the ACF 
resnlts do show a seasonal correlation of the sales fignres. In other words, 
the pattern of sales from one year to the next is fairly consistent. 


Adjusting for Seasonality 

Becanse the liqnor sales data have a seasonal component, it wonld he nsefnl 
to adjnst the valnes for the seasonal effect. In this way yon can determine 
whether a drop in prodnction dnring one month is dne to seasonal effects 
or is a trne decline. Adjnsting the prodnction data for seasonality also gives 
yon a better indication of the trend in liqnor sales over the conrse of the 
10 years of the stndy. Yon can nse StatPlns to adjnst time series data for 
mnltiplicative seasonality. 

To adjust the liquor sales data: 

1 Return to the Liquor Sales worksheet. 

2 Click Time series from the StatPlns menu and then click Seasonal 
Adjustment names list. Click OK. 

3 Verify that a period of length 12 is entered into the Length of Period 
box. 

4 Click the Output button and send the output to a new worksheet 
named Adjusted Sales. Click OK. 

Your completed dialog box should look like Figure 11-28. 


Figure 11 -28 
The Perform 
Seasonal 
Adjustment 
dialog box 
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Click OK. 

Excel generates the adjusted sales values shown in Figure 11-29. 



Figure 11 -29 
Liquor sales 
adjusted for 
seasonal effects 


observed adjusted plot of observed and 



The observed production levels are shown in column B, and the season¬ 
ally adjusted values are shown in column C. Using the adjusted values can 
give you some insight into the changing sales values adjusted for seasonal 
effects. For example, between observations 4 and 5 the sales value increases 
by 160 units (representing an increase of $160 million); however, when ad¬ 
justed for the seasonal effects, the increase in sales is about $3 million. In 
other words, when adjusting for the effects of seasonal variation, the sales in¬ 
creased that month by $3 million over what would be expected in a usual year. 

You can get some idea of the relative sales for different months of the year 
from the table of seasonal indexes. For example, the seasonal index for June is 
1.002 and for July it is 1.040. This indicates that you can expect a percentage 
increase in liquor sales of (1.040 — 1.002)/1.002 = 0.037546, or about 
3.75%, going from June to July each year. Seasonal indexes for the multipli¬ 
cative model must add up to the length of the period, in this case 12. You can 
use this information to tell you that 11.64% of the liquor sales take place in 
December (because 1.397/12 = 0.1164). A line plot of the seasonal indexes 
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is provided and shows a profile very similar to the one yon saw earlier with 
the hoxplot. 

A chart is also inclnded, showing hoth the prodnction data and the 
adjnsted prodnction valnes. There is a clear increase in liqnor sales in the 
data set after adjnsting for seasonal variation. To fnrther explore this trend, 
yon can smooth the sales data nsing three-parameter exponential smoothing. 


Three-Parameter Exponential Smoothing 

Yon perform exponential smoothing on seasonal data nsing three smoothing 
constants. This process is known as three-parameter exponential smooth¬ 
ing or Winters’ method. The smoothing constants in the Winters’ method 
involve location, trend, and seasonality. Winters’ method can he nsed 
for either mnltiplicative or additive seasonality, thongh in this text, we’ll 
assnme only mnltiplicative seasonality. The eqnation for a time series variable 
y, with a mnltiplicative seasonality adjnstment is 

Yt = (/3o + /3it) X tp + fif 

and for additive seasonality adjnstment the eqnation is 

Yt = (/3o + /3it) + tp + fit 

In these eqnations [Sq, j3i, and ej once again represent the location, trend, 
and error parameters of the model and Ip represents the seasonal index at point 
p in the seasonal data. For example, if we nsed the mnltiplicative seasonal 
indexes shown in Fignre 11-29, 7g wonld eqnal 1.015. Once again, these para¬ 
meters are not considered to he constant hnt can vary with time. The liqnor 
sales data are an example of snch a series. The sales are seasonal, hnt there 
is also a time trend to the data snch that sales increase from year to year after 
adjnsting for seasonality. 

Let’s concentrate on smoothing with a mnltiplicative seasonality factor. 
The smoothing eqnations nsed in three-parameter exponential smoothing 
are similar to eqnations yon’ve already seen. For the smoothed location 
valne S„ and the smoothed trend valne T^, from a time series where the 
length of the period is q, the recnrsive eqnations are 

s. = + (1 - w)(s,_i + r„_ J 

^n — q 

Tj, = t(S„ — ■+• (l — 

Note that the recnrsive eqnation for is identical to the eqnation nsed in 
two-parameter exponential smoothing except that the cnrrent observation y„ 
mnst be seasonally adjnsted. Here, is the seasonal index taken from the 
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index values of the previous period. The recursive equation for T„ is identi¬ 
cal to the recursive equation for the tw^o-parameter model. 

As you would expect, three-parameter smoothing also smoothes out the 
values of the seasonal indexes, because these might also change over time. 
We use a different smoothing constant for these indices. The recursive equa¬ 
tion for a seasonal index is 

In = C^+il- 

The value of the smoothed seasonal index is a weighted average of the 
current seasonal index based on the values of and S„, and the index value 
from the previous period. Calculating initial estimates of Sq, Tq, and each 
initial seasonal index is beyond the scope of this book. 


Forecasting Liquor Sales 

Let’s use exponential smoothing to predict future liquor sales. For the pur¬ 
poses of demonstration, we’ll assume a multiplicative model. You will have 
to decide on values for each of the three smoothing constants. The values 
need not be the same. For example, seasonal adjustments often change more 
slowly than the trend and location factors, so you might want to choose a 
low value for the seasonal smoothing constant, say about 0.05. However, 
if you feel that the trend factor or location factor will change more rapidly 
over the course of time, you will want a higher value for the smoothing con¬ 
stant, such as 0.15. As you have seen in this chapter, the values you choose 
for these smoothing constants depend in part on your experience with the 
data. Excel does not provide a feature to do smoothing with the Winters’ 
method. One has been provided for you with the Exponential Smoothing 
command found in StatPlus. 

To forecast future liquor sales: 

Return to the Liquor Sales worksheet. 

Click Time Series from the StatPlus menu and then click Exponential 
Smoothing. 

Select Sales for your Data Values variable. 

Enter 0.15 in the General Options Weight box. This is your value for w. 

Click the Linear Trend option button, and enter 0.15 in the Linear 
Weight box. This is your value for t. 

Click the Multiplicative option button, and enter 0.05 in the Seasonal 
Weight box. This is the value of c. Verify that the length of the period 
is set to 12. 


1 

2 

3 

4 

5 

6 
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7 Click the Forecast checkbox and enter 12 in the Units ahead box. 
This will forecast the next year liqnor sales. 

8 Verify that 0.95 is entered in the Confidence Interval box. This will 
prodnce a 95% confidence region aronnd yonr forecasted valnes. 

9 Click the Output button and direct your output to a new sheet named 

Forecasted Sales. Click OK. 

Your dialog box should look like Figure 11-30. 


Figure 11-30 
The Perform 
Exponential 
Smoothing 
dialog box 



I 0 Click OK. 

Output from the command appears on the Forecasted Sales work¬ 
sheet. To view the forecasted values, drag the vertical scroll bar 
down. See Figure 11-31. 
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Figure 11-31 
Forecasted 
sales values 
with 95% 
confidence 
region 



The output does not give the month for each forecast, hut you can easily 
confirm that observation 145 in colnmn A is January 2008 because observa¬ 
tion 144 on the Liquor Sales worksheet is December 2007. On the basis of 
the values in column B, you forecast that in the next year, sales will reach a 
peak in December (observation 150) with a sales figure of $5,093.5 million. 
The 95% prediction interval for this estimate is about $4,932.9 million to 
$5,254.2 million. In other words, in December you would expect sales of 
not less than 4,932.9 million dollars or more than 5,254.2 million dollars. 
You could use these estimates to plan yonr sales strategy for the upcom¬ 
ing year. Before putting much faith in the prediction intervals, you should 
verify the assumptions for the smoothed forecasts. If the smoothing model 
is correct, the residuals should be independent (show no discernible ACF 
pattern) and follow a normal distribution with mean 0. Yon wonld find for 
the liquor sales that these assumptions are met. 

Scrolling back up the worksheet, you can view how well the smooth¬ 
ing method forecasted liqnor sales in the previous years, as shown in 
Fignre 11-32. 
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Figure 11 -32 
Three- 
parameter 
exponential 
smoothing 
values 


sales values 


forecasted for the descriptive statistics and 



next year 


smoothing values 


w 

t 
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The standard error of the forecast is 72.60194, indicating that the typical 
forecasting error in the time series is ahont 73 nnits (the MAD value is lower 
with a value of about 57.6 units). 

The final estimates for the location and trend values are 3488.428 and 
14.1268, respectively. The location value represents the monthly sales after 
adjusting for seasonal effects. The trend estimate indicates that sales are in¬ 
creasing at a rate of about $14.13 million per month—hardly a large increase 
given the magnitude of the monthly sales. 

The output also includes a scatterplot comparing the observed, smoothed, 
and forecasted production values. Because of the number of points in the 
time series, the seasonal curves are close together and difficult to interpret. 
To make it easier to view the comparison between the observed and fore¬ 
casted values, rescale the x axis to show only the current year and the fore¬ 
casted year’s values. 

To rescale the x axis: 

1 Click the plot to select it. 

2 Click the Axes button on the Axes group of the Layout tab on the 
ChartTools ribbon and then click Primary Horizontal Axis and More 
Primary Horizontal Axes Options. 

3 Click the Fixed option button for the Minimum scale value and 
change it to 130. Click the Close button. 

The revised plot appears in Figure 11-33. 
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Figure 11 -33 
Plot of 
forecasted 
and observed 
sales for the 
current and 
upcoming year 
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From the rescaled plot, you would conclude that exponential smoothing 
has done a good joh of modeling the sales data and that the resnlting fore¬ 
casts appear reasonable. 

The final part of the exponential smoothing ontpnt is the final estimate of 
the seasonal indexes, shown in Fignre 11-34. 


Figure 11 -34 
Seasonal 
indices 
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The valnes for the seasonal indexes are very similar to those yon calcn- 
lated nsing the seasonal adjnstment command. The difference is dne to the 
fact that these seasonal indexes are calcnlated nsing a smoothed average, 
whereas the earlier indexes were calcnlated nsing an nnweighted average. 

Yon’re finished with the workbook. Yon can close it now, saving yonr 
changes. 
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Optimizing the Exponential Smoothing 
Constant (optional) 

As you’ve seen in this chapter, the choice for the valne of the exponential 
smoothing constant depends partly on the analyst’s experience and intn- 
ition. Many analysts advocate nsing the valne that minimizes the mean 
sqnare error. Yon can nse an Excel add-in called the Solver to calcnlate this 
valne. To demonstrate how this techniqne works, open the Exponential 
workbook, which contains a set of time series data. 

To open the Exponential workbook: 

1 Open Exponential workbook from the Chapterll data folder. 

2 Save the file as Exponential Smoothing. 


The workbook displays the colnmn of sample time series data. Let’s create a 
colnmn of exponentially smoothed forecasts. First we mnst decide on an initial 
estimate for the smoothing constant w; we can start with any valne we want, so 
let’s start with 0.15. From this valne, we’ll calcnlate the mean sqnare error. 

To calcnlate the mean square error: 

1 Click cell Fl, type 0.15, and then press Enter. 

Next determine a valne for Sq to be pnt in cell C2. We’ll nse the first 
valne in the time series. 

2 Click cell C2, type =B2, and then press Enter. 

Now create a column of smoothed forecasts S„, using the recursive 
smoothing equation. 

3 Select the range C3:C120. 

4 Type =$F$l*B2-h(l-$F$l)*C2; then press Enter. 

5 Click the Fill button 3 from the Editing group on the Home tab and 
click Down to fill the formula down the rest of the column. 

Now create a column of squared errors [(forecast - observed)^]. 

6 Select the range D2:D120. 

7 Type =(C2-B2)''2, and then press Enter. 

8 Fill the formula down the rest of the column. 

Finally, calculate the mean square error for this particular value of w. 
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Click cell F2, type =SUM(D2:D120)/119, and then press Enter. 


Figure 11-35 
Exponential 
smoothing 
values 


Verify that the valnes in yonr spreadsheet match the valnes in 
Fignre 11-35. 



Yon now have everything yon need to nse Solver. 


1 

2 
3 

V 


To open Solver: 

Click the Office hntton and then click Excel Options. 

Click Add-Ins from the list of Excel Options and then click Go next 
to the Manage Excel Add-Ins list hox. 

Click the Solver Add-In check hox if it is not already selected and 
then click the OK hntton. 


Once the Solver is installed and activated, yon can determine the optimal 
valne for the smoothing constant. 

To determine the optimal valne for the smoothing constant: 

1 Click the Solver hntton located on the Analysis gronp in the Data tah. 

2 Type F2 in the Set Target Cell text hox. This is the cell that yon will 
nse as a target for the Solver. 

3 Click the Min option hntton to indicate that yon want to minimize 
the valne of the mean sqnare error (cell F2). 
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4 Type Fl in the By Changing Cells text hox to indicate that yon want to 
change the valne of Fl, the smoothing constant, in order to minimize 
cell F2. 

Becanse the exponential smoothing constant can take on only val- 
nes between 0 and 1, yon have to add some constraints to the valnes 
that the Solver will investigate. 

5 Click the Add hntton. 

6 Type Fl in the Cell Reference text hox, select <= from the Constraint 
drop-down list, type 1 in the Constraint text hox, and then click Add. 

7 Type Fl in the Cell Reference text hox, select > = from the Constraint 
drop-down list, type 0 in the Constraint text hox, and then click Add. 

8 Click Cancel to retnrn to the Solver Parameters dialog text hox. The 
completed Solver Parameters dialog hox shonld look like Fignre 11-36. 


Figure 11-36 
The Solver 
Parameters 
dialog box 



9 Click Solve. 

The Solver now determines the optimal valne for the smoothing 
constant (at least in terms of minimizing the mean square error). 
When the Solver is finished, it will prompt you either to keep the 
Solver solution or to restore the original values. 

I 0 Click OK to keep the solution. 


The Solver returns a value of 0.028792 (cell Fl) for the smoothing con¬ 
stant, resulting in a mean square error of 23.99456 (cell F2). This is the 
optimal value for the smoothing constant. 
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It’s possible to set up similar spreadsheets for two-parameter and three- 
parameter exponential smoothing, bnt that will not be demonstrated here. 
The main difficnlty in setting np the spreadsheet to do these calcnlations is 
in determining the initial estimates of Sq, Tq, and the seasonal indexes. 

In the case of two-parameter exponential smoothing, yon wonld nse lin¬ 
ear regression on the entire time series to derive initial estimates for the 
location and trend valnes. Once this is done, yon wonld derive the fore¬ 
casted valnes nsing the recnrsive eqnations described earlier in the chap¬ 
ter. Yon wonld then apply the Solver to minimize the mean sqnare error 
of the forecasts by modifying both the location and the trend smoothing 
constants. Using the Solver to derive the best smoothing constants for the 
three-parameter model is more complicated becanse yon have to come 
np with initial estimates for all of the seasonal indexes. The interested 
stndent can refer to more advanced texts for techniqnes to calcnlate the 
initial estimates. 

Yon can now save and close the Exponential Smoothing workbook. 


Exercises 

1 . Do the following calcnlations for one- 
parameter exponential smoothing, 
where w = 0 . 10 : 

a. S 4 = 23.4 and jg = 29. What is Sg? 

b. If the observed valne of is 25, what 
is the valne of Sg? Assnme the same 
valnes as in part a. 

2 . Do the following calcnlation for two- 
parameter exponential smoothing, 
where w = 0.10 and t = 0 . 20 : 

a. S 4 = 23.4, = 1.1, and = 29. 

What is Sg? What is Tg? 

b. If the observed valne of y^ is 25, what 
are the valnes of Sg and Tg? Assnme 
the same valnes as in part a. 

3. If monthly sales are eqnal to 4,811 nnits 
and the seasonal index for that month is 
0.85, what is the adjnsted sales fignre? 

4. How can yon tell whether a series is 
seasonal? Mention plots, inclnding the 
ACF. What is the difference between 
additive and mnltiplicative seasonality? 


5. A politician citing the latest raw monthly 
nnemployment fignres claimed that 
nnemployment had fallen by 88,000 
workers. The Bnrean of Labor Statistics, 
however, nsing seasonally adjnsted 
totals, claimed that nnemployment had 
increased by 98,000. Discnss the two in¬ 
terpretations of the data. Which nnmber 
gives a better indication of the state of 
the economy? 

6 . The Batting Average workbook contains 
data on the leading major leagne 
baseball batting averages for the years 
1901 to 2002. Analyze these data. 

a. Open the Batting Average workbook 
from the Chapterll folder and save it 
as Batting Average Analysis. 

b. Create a line chart of the batting aver¬ 
age versns year. Do yon see any appar¬ 
ent trends? Do yon see any ontliers? 
Does George Brett’s average of 0.390 
in 1980 stand ont compared with 
other observations? 
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c. Insert a trend line smoothing the 
hatting average nsing a ten-year 
moving average. 

d. Calculate the ACF and state your con¬ 
clusions (notice that the ACF does 
not drop off to zero right aw^ay, which 
suggests a trend component). 

e. Calculate the difference of the hatting 
averages from one year to the next. 
Plot the difference series and also 
compute its ACF. Does the plot show 
that the variance of the original series 
is reasonably stable? That is, are the 
changes roughly the same size at the 
beginning, middle, and end of the 
series? 

f. Looking at the ACF of the differenced 
series, do you see much correlation 
after the first few lags? If not, it sug¬ 
gests that the differenced series does 
not have a trend, and this is what you 
would expect. Interpret any lags that 
are significantly correlated. 

g. Perform one-parameter exponential 
smoothing forecasting one year ahead, 
using w values of 0.2, 0.3, 0.4, and 
0.5. In each case, notice the value 
predicted for 2003 (observation 103). 
Which parameter gives the lowest 
standard error? 

h. Save your changes to the workbook 
and write a report summarizing your 
observations. 

7. The Electric workbook has monthly data 
on U.S. electric power production, 1978 
through 1990. The variable called power 
is measured in billions of kilowatt 
hours. The figures come from the 1992 
CRB Commodity Year Book, published 
by the Commodity Research Bureau in 
New York. 

a. Open the Electric workbook from 
the Chapterll folder and save it as 

Electric Analysis. 

b. Create a line chart of the power data. 
Is there any seasonality to the data? 


c. Fit a three-parameter exponential 
model with location, linear, and sea¬ 
sonal parameters. Use a smoothing 
constant of 0.05 for the location 
parameter, 0.15 for the linear parameter, 
and 0.05 for the seasonal parameter. 
What level of power production do 
you forecast over the next 12 months? 

d. Using the seasonal index, which are 
the three months of highest power 
production? Is this in accordance 
with the plots you have seen? Does 
it make sense to you as a consumer? 
By what percentage does the busiest 
month exceed the slowest month? 

e. Repeat the exponential smoothing of 
part b of Exercise 6 with the smooth¬ 
ing constants shown in Table 11-7. 


Table 11-7 Exponential Smoothing Constants 


Location 

Linear 

Seasonal 

0.05 

0.30 

0.05 

0.15 

0.15 

0.05 

0.15 

0.30 

0.05 

0.30 

0.15 

0.05 

0.30 

0.30 

0.05 


f. Which forecasts give the smallest 
standard error? 

g. Save your changes to the workbook 
and report your observations. 

8. The Visit workbook contains monthly 
visitation data for two sites at the Kenai 
Fjords National Park in Alaska from 
January 1990 to June 1994. You’ll ana¬ 
lyze the visitation data for the Exit 
Glacier site. 

a. Open the Visit workbook from the 
Chapterll data folder and save it as 

Visit Analysis. 

b. Create a line plot of visitation for Exit 
Glacier versus year and month. Sum¬ 
marize the pattern of visitation at Exit 
Glacier between 1990 and mid-1994. 
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c. Create two line plots, one showing 
the visitation at Exit Glacier plotted 
against year with different lines for 
different months, and the second 
showing visitation plotted against 
month with different lines for differ¬ 
ent years (you will have to create a 
two-way table for this). Are there any 
unusual values? How might the June 
1994 data influence future visitation 
forecasts? 

d. Calculate the seasonally adjusted 
values for visits to the park. Is there 
a particular month in which visits to 
the park jump to a new and higher 
level? 

e. Smooth the visitation data using ex¬ 
ponential smoothing. Use smoothing 
constants of 0.15 for hoth the location 
and the linear parameters, and use 
0.05 for the seasonal parameter. Fore¬ 
cast the visitation 12 months into the 
future. What are the projected values 
for the next 12 months? 

f. A lot of weight of the projected visita¬ 
tions for 1994-1995 is based on the 
jump in visitation in June 1994. 
Assume that this jump was an aber¬ 
ration, and refit two exponential 
smoothing models with 0.05 and 0.01 
for the location parameter (to reduce 
the effect of the June 1994 increase], 
0.15 for the linear parameter, and 0.05 
for the seasonal parameter. Compare 
your results with your first forecasts. 
How do the standard errors compare? 
Which projections would you work 
with and why? What further informa¬ 
tion would you need to decide be¬ 
tween these three projections? 

g. What problems do you see with either 
forecasted value? [Hint: Look at the 
confidence intervals for the forecasts.) 

h. Save your changes to the workbook 
and write a report summarizing your 
observations. 


9. The visitation data in the Visit workbook 
cover a wide range of values. It might 
be appropriate to analyze the log^g of 
the visitation counts instead of the raw 
counts. 

a. Open the Visit workbook from the 
Chapter 11 folder and save it as Visit 
Log Analysis. 

b. Create a new column in the workbook 
of the logjg counts of the Exit Glacier 
data (use the Excel function log^g). 

c. Create a line plot of log^j, (visitation] 
for the Exit Glacier site from 1990 to 
mid-1994. What seasonal values does 
this chart reveal that were hidden 
when you charted the raw counts? 

d. Use exponential smoothing to smooth 
the logjg (visitation] data. Use a value 
of 0.15 for the location and linear ef¬ 
fects, and use 0.05 for the seasonal 
effect. Project log^^ (visitation] 12 
months into the future. Untransform 
the projections and the prediction 
intervals by raising 10 to the power 
of logjp (visitation] [that is, if log^p 
(visitation] = 1.6, then visitation = 

10^ ® = 39.8]. What do you project for 
the next year at Exit Glacier? What are 
the 95% prediction intervals? Are the 
upper and lower limits reasonable? 

e. Redo your forecasts, using 0.01 and 
then 0.05 for the location parameter, 
0.15 for the linear parameter, and 0.05 
for the seasonal parameter. Which of 
the three projections results in the 
smallest standard error? 

f. Compare your chosen projections 
from Exercise 8, using the raw counts, 
with your chosen projections from 
this exercise, using the log^g trans¬ 
formed counts. Which would you use 
to project the 1994-1995 visitations? 
Which would you use to determine 
the amount of personnel you will 
need in the winter months and why? 
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g. Save your changes to the workbook 
and write a report snmmarizing yonr 
conclnsions. 

10. The NFP workbook contains daily body 
temperatnre data for 239 consecntive 
days for a woman in her twenties. Daily 
temperatnre readings are one compo¬ 
nent of natnral family planning (NFP) 
in which a woman nses her monthly 
cycle with a nnmber of biological signs 
to determine the onset of ovnlation. 

The file has fonr colnmns: Observation, 
Period (the menstrnal period), Day (the 
day of the menstrnal period), and Wak¬ 
ing Temperatnre. Day 1 is the first day of 
menstrnation. 

a. Open the NFP workbook from the 
Chapterll folder and save it as NFP 

Analysis. 

b. Create a line plot of the daily body 
temperatnre valnes. Do yon see any 
evidence of seasonality in the data? 

c. Create a boxplot of temperatnre ver- 
sns day. What can yon determine 
abont the relationship between 
body temperatnre and the onset of 
menstrnation? 

d. Calculate the ACF for the temperature 
data up through lag 70. On the basis 
of the shape of the ACF, what would 
you estimate as the length of the pe¬ 
riod in days? 

e. Smooth the data nsing exponential 
smoothing. Use 0.15 as the location 
parameter, 0.01 for the linear param¬ 
eter (it will not be important in this 
model), and 0.05 for the seasonal 
parameter. Use the period length that 
yon estimated in part c of Exercise 9. 
What body temperatnre valnes do yon 
forecast for the next cycle? 

f. Repeat yonr forecast with valnes of 
0.15 and 0.25 for the seasonal param¬ 
eters. Which model has the lowest 
standard error? 


g. Save yonr changes to the workbook 
and write a report snmmarizing yonr 
conclnsions. 

11. The Draft workbook contains data from 
the 1970 Selective Service draft. Each 
birth date was given a draft nnmber. 
Those eligible men with a low draft 
nnmber were drafted first. One way 

of presenting the draft nnmber data is 
throngh exponential smoothing. The 
draft nnmbers vary greatly from day to 
day, bnt by smoothing the data, yon may 
be better able to spot trends in the draft 
nnmbers. In this exercise, yon’ll nse 
exponential smoothing to examine the 
distribntion of the draft nnmbers. 

a. Open the Draft workbook from the 
Chapterll folder and save it as Draft 
Number Analysis. 

b. Create one-parameter exponential 
smoothed plots of the nnmber vari¬ 
able on the Draft Nnmbers worksheet. 
Use valnes of 0.15, 0.085, and 0.05 for 
the location parameter. Which valne 
results in the lowest mean sqnare 
error? 

c. Examine yonr plots. Does there ap¬ 
pear to be any sort of pattern in the 
smoothed data? 

d. Test to see whether any antocorrela- 
tion exists in the draft nnmbers. Test 
for antocorrelation np to a lag of 30. Is 
there any evidence for antocorrelation 
in the time series? 

e. Save yonr changes to the workbook 
and write a report snmmarizing yonr 
observations. 

12. The Oil workbook displays informa¬ 
tion on monthly prodnction of crnde 
cottonseed oil from 1992 to 1995. The 
prodnction of cottonseed oil follows a 
seasonal pattern. Using the data in this 
workbook, project the monthly valnes 
for 1996. 
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a. Open the Oil workbook from the 
Chapter 11 folder and save it as Oil 

Forecasts. 

b. Restructure the data in the worksheet 
into a two-way table. Create a line 
plot of the production values in the 
table using a separate line for each 
year. Describe the seasonal nature of 
cottonseed oil production. 

c. Smooth the production data using a 
value of 0.15 for all three smoothing 
factors. Forecast the values 12 months 
into the future. What are your projec¬ 
tions and your upper and lower limits 
for 1996? 

d. Adjust the production data for the 
seasonal effects. Is there evidence that 
the adjusted production values have 
increased over the four-year period? 
Test your assumption by performing 

a linear regression of the adjusted 
values on the month number (1-48). 

Is the regression significant at the 
5% level? 

e. Save your changes to the workbook 
and write a report summarizing your 
conclusions. 

13. The Bureau of Labor Statistics records 
the number of work stoppages each 
month that involve 1000 or more work¬ 
ers in the period. Are such work stop¬ 
pages seasonal in nature? Are there 
more work stoppages in summer than 
in winter? 

a. Open the Stoppage workbook from 
the Chapterll folder and save it as 

Stoppage Analysis. 

b. Restructure the data in the Work Stop¬ 
page worksheet into a two-way table, 
with each year in a separate row and 
each month in a separate column. 

c. Use the two-way table to create a box- 
plot and line plot of the work stop¬ 
page values. Which months have the 


highest work stoppage numbers? Do 
work stoppages occur more often in 
winter or in summer? 

d. Adjust the work stoppage values 
assuming a 12-month cycle. Is there 
evidence in the scatterplot that the 
adjusted number of work stoppages 
has decreased over the past decade? 

e. Smooth adjusted values using one- 
parameter exponential smoothing. 

Use a value of 0.15 for the smoothing 
parameter. 

f. Save your changes to the workbook. 
Summarize your findings regarding 
work stoppages of 1000 or more work¬ 
ers. Are they seasonal? Have they de¬ 
clined in recent years? Use whatever 
charts and tables you created to sup¬ 
port your conclusions. 

14. The Jobs workbook contains monthly 
youth unemployment rates from 1981 to 
1996. Analyze the data in the workbook 
and try to determine whether unemploy¬ 
ment rates are seasonal. 

a. Open the Jobs workbook from the 
Chapterll folder. Save it as Jobs 
Analysis. 

b. Restructure the data in the Youth 
Unemployment worksheet into a two- 
way table, with each year in a separate 
row and each month in a separate 
column. 

c. Create a spaghetti plot of the unem¬ 
ployment values. 

d. Create a boxplot of youth unemploy¬ 
ment rates. Is any pattern apparent in 
the boxplot? 

e. Adjust the unemployment rates as¬ 
suming a 12-month cycle. Is there 
evidence in the chart that youth un¬ 
employment varies with the season? 

f. Save your changes to the workbook 
and write a report summarizing your 
observations. 
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Chapter I 2 


Quality Control 

Objectives 

In this chapter you will learn to: 

P- Distinguish between controlled and uncontrolled variation 
P- Distinguish between variables and attributes 
P- Determine control limits for several types of control charts 
P- Use graphics to create statistical control charts with Excel 
P- Interpret control charts 
P- Create a Pareto chart 
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I n this chapter you will look at one of the statistical tools used in 
manufacturing and industry. The proper use of quality control can 
improve productivity, enhance quality, and reduce production costs. 
In this chapter, you’ll learn about one such tool, the control chart, that is 
used to determine when a process is out of control and requires human 
intervention. 


Statistical Quality Control 

The immediately preceding chapters have heen dedicated to the identifica¬ 
tion of relationships and patterns among variables. Such relationships are 
not immediately obvious, mainly because they are never exact for individual 
observations. There is always some sort of variation that obscures the true 
association. In some instances, once the relationship has been identified, an 
understanding of the types and sources of variation becomes critical. This 
is especially true in business, where people are interested in controlling the 
variation of a process. A process is any activity that takes a set of inputs and 
creates a product. The process for an industrial plant takes raw materials and 
creates a finished product. A process need not be industrial. For example, 
another type of process might be to take unorganized information and pro¬ 
duce an organized analysis. Teaching could even be considered a process, 
because the teacher takes uninformed students and produces students capa¬ 
ble of understanding a subject (such as statistics!). In all such processes, peo¬ 
ple are interested in controlling the procedure so as to improve the quality. 
The analysis of processes for this purpose is called statistical quality control 
(SQC) or statistical process control (SPC). 

Statistical process control originated in 1924 with Walter A. Shewhart, 
a researcher for Bell Telephone. A certain Bell product was being manufac¬ 
tured with great variation in quality, and the production managers could not 
seem to reduce the variation to an acceptable level. Dr. Shewhart developed 
the rudimentary tools of statistical process control to improve the homoge¬ 
neity of Bell’s output. Shewhart’s ideas were later championed and refined 
by W. Edwards Deming, who tried unsuccessfully to persuade U.S. firms to 
implement SPC as a methodology underlying all production processes. Hav¬ 
ing failed to convince U.S. executives of the merits of SPC, Deming took his 
cause to Japan, which, before World War II, was renowned for its shoddy 
goods. The Japanese adopted SPC wholeheartedly, and Japanese production 
became synonymous with high and uniform quality. In response, U.S. firms 
jumped on the SPC bandwagon, and many of their products regained mar¬ 
ket share. 
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Controlled Variation 


The reduction of variation in any process is beneficial. However, yon can 
never eliminate all variation, even in the simplest process, becanse there 
are bonnd to be many small, nnobservable, chance effects that inflnence 
the process ontcome. Variation of this kind is called controlled variation 
and is analogons to the random-error effects in the ANOVA and regres¬ 
sion models yon stndied earlier. As in those statistical models, many in- 
dividnally insignificant random factors interact to have some net effect on 
the process ontpnt. In qnality-control terminology, this random variation 
is said to be “in control,” not becanse the process operator is able to con¬ 
trol the factors absolntely, bnt rather becanse the variation is the resnlt of 
normal distnrbances, called common causes, within the process. This type 
of variation can be predicted. In other words, given the limitations of the 
process, each of these common canses is controlled to the greatest extent 
possible. 

Becanse controlled variation is the resnlt of small variations in the nor¬ 
mally fnnctioning process, it cannot be rednced nnless the entire process 
is redesigned. Fnrthermore, any attempts to rednce the controlled variation 
withont redesigning the process will create more, not less, variation in the 
process. Endeavoring to rednce controlled variation is called tampering; this 
increases costs and mnst be avoided. Tampering might occnr, for instance, 
when operators adjnst machinery in response to normal variations in the 
prodnction process. Becanse normal variations will always occnr, adjnst- 
ing the machine is more likely to harm the process, actnally increasing the 
variation in the process, than to help it. 


Uncontrolled Variation 

The other type of variation that can occnr within a process is called nncon- 
trolled variation. Uncontrolled variation is due to special causes, which are 
sources of variation that arise sporadically and for reasons outside the nor¬ 
mally functioning process. Variation induced by a special cause is usually 
significant in magnitude and occurs only occasionally. Examples of special 
causes include differences between machines, different skill or concentra¬ 
tion levels of workers, changes in atmospheric conditions, and variation in 
the quality of inputs. 

Unlike controlled variation, uncontrolled variation can be reduced by 
eliminating its special cause. The failure to bring uncontrolled variation 
into control is costly. 

SPC is a methodology for distinguishing whether variation is controlled 
or uncontrolled. If variation is controlled, then only improvements in the 
process itself can reduce it. If variation is uncontrolled, then further analy¬ 
sis is needed to identify and eliminate the special cause. 

Table 12-1 summarizes the two types of variation studied in SPC. 
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Table 12-1 Types of Variation 


Variation 

Descriptive 

Remedy 

Controlled 

Variation that is native to the 
process, resnlting from normal 
factors called common canses 

Redesign the process to resnlt in 
a new set of controlled variations 
with better properties. 

Uncontrolled 

Variation that is the resnlt of 
special canses and need not be 
inherent in the process 

Analyze the process to locate 
the sonrce of the nncontrolled 
variation and then remove or fix 
that special canse. 


Control Charts 

The principal tool of SPC is the control chart. A control chart is a graph of the 
process valnes plotted in time order. Fignre 12-1 shows a sample control chart. 



upper control limit (UCL) 


center line 


lower control limit (LCL) 


The chief featnres of the control chart are the lower and upper control 
limits (LCL and UCL, respectively), which appear as dotted horizontal lines. 
The solid line between the npper and lower control limits is the center line 
and indicates the expected valnes of the process. 

As the process goes forward, valnes are added to the control chart. As 
long as the points remain between the lower and npper control limits, we 
assnme that the observed variation is controlled variation and that the pro¬ 
cess is in control (there are a few exceptions to this rnle, which weTl dis- 
cnss shortly). Fignre 12-1 shows a process that is in control. It is important 
to note that control limits do not represent specification limits or maxi- 
mnm variation targets. Rather, control limits illnstrate the limits of normal 
controlled variation. 


490 Statistical Methods 






















In contrast, the process depicted in Fignre 12-2 is out of control. Both the 
fonrth and the twelfth observations lie ontside of the control limits, leading 
ns to believe that their valnes are the resnlt of nncontrolled variation. At 
this point a shop manager, or the person responsible for the process, might 
examine the conditions for those observations that resnlted in snch extreme 
values. An analysis of the causes could lead to a better, more efficient, and 
more stable process. 



Even control charts in which all points lie between the control limits 
might suggest that a process is out of control. In particular, the existence of a 
pattern in eight or more consecutive points indicates a process out of control, 
because an obvious pattern violates the assumption of random variability. In 
Figure 12-3, for example, the last eight observations depict a steady upward 
trend. Even though all of the points lie within the control limits, you must 
conclude that this process is out of control because of the evident trend the 
data values exhibit. 


Figure 12-3 
A process out 
of control 
because of 
an upward 
trend 
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Another common example of a process that is ont of control, even thongh 
all points lie hetw^een the control limits, appears in Fignre 12-4. The first 
eight observations are helow the center line, whereas the last seven obser¬ 
vations all lie above the center line. Becanse of prolonged periods where 
valnes are either small or large, this process is ont of control. One conld nse 
the Rnns test, discnssed in Chapter 8 in the context of examining residnals, 
to test whether the data valnes are clnstered in a nonrandom way. 


Figure 12-4 
A process out 
of control 
because of a 
nonrandom 
pattern 



Here are two other sitnations that may show a process ont of control, even 
thongh all valnes lie within the control limits. 

• 9 points in a row, all on the same side of the center line 

• 14 points in a row, alternating above and below the center line 

Other snspicions patterns conld appear in control charts. Unfortnnately, 
we cannot discnss them all here. In general, thongh, any clear pattern in the 
process valnes indicates that a process is snbject to nncontrolled variation 
and that it is not in control. 

Statisticians nsnally highlight ont-of-control points in control charts by 
circling them. As yon can see, the control chart makes it very easy for yon to 
identify visnally points and processes that are ont of control withont nsing 
complicated statistical tests. This makes the control chart an ideal tool for 
the shop floor, where qnick and easy methods are needed. 


Control Charts and Hypothesis Testing 

The idea nnderlying control charts shonld be familiar to yon. It is closely 
related to confidence intervals and hypothesis testing. The associated nnll 
hypothesis is that the process is in control; yon reject this nnll hypothesis if 
any point lies ontside the control limits or if any clear pattern appears in the 
distribntion of the process valnes. Another insight from this analogy is that 
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the possibility of making errors exists, just as errors can occur in standard 
hypothesis testing. In other words, occasionally a point that lies outside the 
control limits does not have any special cause but occurs because of normal 
process variation. On the other hand, there could exist a special cause that 
is not big enough to move the point outside of the control limits. Statistical 
analysis can never be 100% certain. 


Variable and Attribute Charts 

There are two categories of control charts: those which monitor variables 
and those which monitor attributes. Variable charts display continuous 
measures, such as weight, diameter, thickness, purity, and temperature. As 
you have probably already noticed, much statistical analysis focuses on the 
mean values of such measures. In a process that is in control, you expect the 
mean output of the process to be stable over time. 

Attribute charts differ from variable charts in that they describe a feature 
of the process rather than a continuous variable such as a weight or volume. 
Attributes can be either discrete quantities, such as the number of defects in 
a sample, or proportions, such as the percentage of defects per lot. Accident 
and safety rates are also typical examples of attributes. 


Using Subgroups 

In order to compare process levels at various points in time, we usually 
group individual observations together into subgroups. The purpose of the 
subgroup is to create a set of observations in which the process is relatively 
stable with controlled variation. Thus the subgroup should represent a set of 
homogeneous conditions. For example, if we were measuring the results of a 
manufacturing process, we might create a subgroup consisting of values from 
the same machine closely spaced in time. Once we create the subgroups, we 
can calculate the subgroup averages and calculate the variance of the values. 
The variation of the process values within the subgroups is then used to cal¬ 
culate the control limits for the entire set of process values. A control chart 
might then answer the question Do the averages between the subgroups vary 
more than expected, given the variation within the subgroups? 


The X Chart 

One of the most common variable control charts is the x chart (the “x bar 
chart”). Each point in the x chart displays the subgroup average against the 
subgroup number. Because observations usually are taken at regular time 
intervals, the subgroup number is typically a variable that measures time. 
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with subgroup 2 occurring after subgroup 1 and before subgroup 3. As an 
example, consider a clothing store in which the owner monitors the length 
of time cnstomers wait to be served. He decides to calcnlate the average wait 
time in half honr increments. The first half honr of cnstomers who were 
served between 9 and 9:30 a.m. forms the first snbgronp, and the owner re¬ 
cords the average wait time dnring this interval. The second snbgronp cov¬ 
ers cnstomers served from 9:30 to 10:00 a.m., and so forth. 

The X chart is based on the standard normal distribntion. The standard 
normal distribntion nnderlies the mean chart, becanse the Central Limit 
Theorem (see Chapter 5) states that the snbgronp averages approximately 
follow the normal distribntion even when the nnderlying observations are 
not normally distribnted. 

The applicability of the normal distribntion allows the control limits 
to be calcnlated very easily when the standard deviation of the process is 
known. Yon might recall from Chapter 5 that 99.74% of the observations 
in a normal distribntion fall within 3 standard deviations of the mean (p). 
In SPC, this means that points that fall more than 3 standard deviations 
from the mean occnr only 0.26% of the time. Becanse this probability is so 
small, points ontside the control limits are assnmed to be the resnlt of nn- 
controlled special canses. Why not narrow the control limits to ±2 standard 
deviations? The problem with this approach is that yon might increase the 
false-alarm rate, that is, the nnmber of times yon stop a process that yon in¬ 
correctly believed was ont of control. Stopping a process can be expensive, 
and adjnsting a process that doesn’t need adjnsting might increase the vari¬ 
ability throngh tampering. For this reason, a 3-standard-deviation control 
limit was chosen as a balance between rnnning an ont-of-control process 
and incorrectly stopping a process when it doesn’t need to be stopped. 

Yon might also recall that the statistical tests yon learned earlier in the 
book differed slightly depending on whether the popnlation standard devia¬ 
tion was known or nnknown. An analogons sitnation occnrs with control 
charts. The two possibilities are considered in the following sections. 


Calculating Control Limits When €r Is Known 

If the trne standard deviation of the process (o-) is known, then the control 
limits are 


LCL = fL — 


3(7 


UCL — Uu —;= 

Vn 

and 99.74% of the points shonld lie between the control limits if the process 
is in control. If cr is known, it nsnally derives from historical valnes. Here, 
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n is the number of observations in the subgroup. Note that in this control 
chart and the charts that follow, n need not be the same for all subgroups. 
Control charts are easier to interpret if this is the case, though. 

The value for p might also be known from past values. Alternatively, p 
might represent the target mean of the process rather than the actual mean 
attained. In practice, though, p might also be unknown. In that case, the 
mean of all of the subgroup averages x replaces p as follows: 

= 3cr 

LCL = X -^ 

Vn 


UCL = X + ^ 

Vn 

The interpretation of the mean chart is the same whether the true process 
mean is known or unknown. 

Here is an example to help you understand the basic mean chart. Stu¬ 
dents are often concerned about getting into courses with “good” profes¬ 
sors and staying out of courses taught by “bad” ones. In order to provide 
students with information about the quality of instruction provided by 
different instructors, many universities use end-of-semester surveys in 
which students rate various professors on a numeric scale. At some schools, 
such results are even posted and used by students to help them decide in 
which section of a course to enroll. Many faculty members object to such 
rankings on the grounds that although there is always some apparent varia¬ 
tion among faculty members, there are seldom any significant differences. 
However, students often believe that variations in scores reflect the profes¬ 
sors’ relative aptitudes for teaching and are not simply random variations 
due to chance effects. 


X Chart Example: Teaching Scores 

One way to shed some light on the value of student evaluations of teaching 
is to examine the scores for one instructor over time. The Teacher work¬ 
book provides data ratings of one professor who has taught principles of 
economics at the same university for 20 consecutive semesters. The in¬ 
struction in this course can be considered a process, because the instructor 
has used the same teaching methods and covered the same material over 
the entire period. Five student evaluation scores were recorded for each 
of the 20 courses. The five scores for each semester constitute a subgroup. 
Possible teacher scores run from 0 (terrible] to 100 (outstanding). The range 
names have been defined in Table 12-2 for the workbook. 
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Table 12-2 The Teacher Workbook 


Range Name 

Range 

Semester 

A2:A21 

Score_l 

B2:B21 

Score_2 

C2:C21 

Score_3 

D2:D21 

Score_4 

E2:E21 

Score 5 

F2:F21 


Description 

The semester of the evaluation 
First student evaluation 
Second student evaluation 
Third student evaluation 
Fourth student evaluation 
Fifth student evaluation 


To open the Teacher workbook: 

1 Open the Teacher workbook from the Chapterl2 data folder. 

2 Save the file as Teacher Control Chart. 

Figure 12-5 displays the content of the workbook. 


Figure 12-5 
The Teacher 
workbook 



There is obviously some variation between scores across semesters, with 
scores varying from a low of 54.0 to a high of 100. Without further analy¬ 
sis, you and your friends might think that such a spread indicates that the 
professor’s classroom performance has fluctuated widely over the course of 
20 semesters. Is this interpretation valid? 
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If you consider teaching to be a process, with student evaluation scores as 
one of its products, you can use SPC to determine whether the process is in 
control. In other words, you can use SPC techniques to determine whether 
the variation in scores is due to identifiable differences in the quality of 
instruction that can be attributed to a particular semester’s course (that is, 
special causes] or is due merely to chance (common causes). 

Historical data from other sources show that a for this professor is 5.0. 
Because there are five observations in each subgroup, n = 5. You can use 
StatPlus to calculate the mean scores for each semester and then the average 
of all 20 mean scores. 

To create a control chart of the teacher’s scores: 

Click QC Charts from the StatPlus menu and then click Xhar Chart. 

Click the Subgroups in rows across columns option button. 

Click the Data Values button and select the range names Score_l 
through Score_5. Click OK. 

Click the Sigma Known checkbox and type 5 in the accompanying 
text box. 

Click the Output button and send the control chart to a new chart 
sheet named XBar Chart. Click OK. 

Figure 12-6 shows the completed dialog box. 


1 

2 

3 

4 

5 


Figure 12-6 
The Create 
an XBAR 
Control Chart 
dialog box 



6 Click OK. 
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Figure 12-7 
Control chart 
of teacher 
scores 



values 
are in 
control 


As you can see from Figure 12-7, no mean score falls outside the con¬ 
trol limits. The lower control limit is 77.462, the mean subgroup average is 
84.17, and the upper control limit is 90.878. There is no evident trend to the 
data or nonrandom pattern. You conclude that there is no reason to believe 
the teaching process is out of control. 

Because we conclude that the process is in control, in contrast to what 
the typical student might conclude from the data, there is no evidence that 
this professor’s performance was better or worse in one semester than in 
another. The raw scores from the last three semesters are misleading. A 
student might claim that using a historical value for cr is also misleading, 
because a smaller value for cr could lead one to conclude that the scores 
were not in control after all. The exercises at the end of this chapter will ex¬ 
amine this issue by redoing the control chart with an unknown value for cr. 

One corollary to the preceding analysis should be stated: Because even 
one professor experiences wide fluctuations in student evaluations over 
time, apparent differences among various faculty members can also be 
deceptive. You should use all such statistics with caution. 

You can close the Teacher Control Chart workbook now, saving your changes. 

Calculating Control Limits When €r Is Unknown 

In many instances, the value of cr is not known. You learned in Chapter 6 
that the normal distribution does not strictly apply for analysis when a is un¬ 
known and must be estimated. In that chapter, the t distribution was used in¬ 
stead of the standard normal distribution. Because SPC is often implemented 
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on the shop floor hy workers who have had little or no formal statistical train¬ 
ing (and might not have ready access to Excel], the method for estimating a 
is simplified and the normal approximation is used to construct the control 
chart. The difference is that when a is unknown, the control limits are esti¬ 
mated using the average range of observations within a subgroup as the mea¬ 
sure of the variability of the process. The control limits are 

LCL = X — A 2 R 

UCL = X 4- A 2 R 

R represents the average of the subgroup ranges, and x is the average of the 
subgroup averages. A^ is a correction factor that is used in quality-control 
charts. As you’ll see, there are many correction factors for different types of 
control charts. Table 12-3 displays a list of common correction factors for 
various subgroup sizes n. 


Text not available due to copyright restrictions 
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A 2 accounts for both the factor of 3 from the earlier equations (used when cr 
was known) and for the fact that the average range represents a proxy for the 
common-cause variation. (There are other alternative methods for calculating 
control limits when tr is unknown.) As you can see from the table, A 2 depends 
only on the number of observations in each subgroup. Furthermore, the control 
limits become tighter when the subgroup sample size increases. The most typi¬ 
cal sample size is 5 because this usually ensures normality of sample means. 
You will learn to use the control factors in the table later in the chapter. 


X Chart Example:A Coating Process 

The data in the Coats workbook come from a manufacturing firm that 
sprays one of its metal products with a special coating to prevent corrosion. 
Because this company has just begun to implement SPC, tr is unknown for 
the coating process. 

To open the Coats workbook: 

1 Open the Coats workbook from the Chapterl2 data folder. 

2 Save the workbook as Coats Control Chart. 

Figure 12-8 shows the contents of the workbook. 


Figure 12-8 
The Coats 
workbook 
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The weight of the spray in milligrams is recorded, with two observations 
taken at each of 28 times each day. Note that the data are arranged differently, 
with the Time colnmn indicating the snhgronp nnmher. The range names 
have heen defined for the workbook in Table 12-4. 

Table 12-4 The Coats Workbook 

Range Name Range Description 

Time A2:A57 The order of the evalnation (also the snhgronp nnmher) 

Weight B2:B57 The weight of the spray in milligrams 


As before, yon can nse StatPlns to create the control chart. Note that 
becanse n = 2 (there are two observations per snhgronp), A 2 = 1.880. 

To create a control chart of the weight valnes: 

1 Click QC Charts from the StatPlns menn and click Xhar Chart. 

2 Click the Data Valnes bntton and select the Weight range name. 
Click OK. 

3 Click the Snhgronps bntton and select Time from the range names 
list. Click OK. 

4 Click the Ontpnt bntton and send the control chart to a new chart 
sheet named XBar Chart. Click OK twice. See Fignre 12-9. 
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The lower control limit is 128.336, the average of the subgroup averages is 
134.446, and the upper control limit is 140.556. Note that although most of 
the points in the mean chart lie between the control limits, four points (obser¬ 
vations 1, 9, 17, and 20) lie outside the limits. This process is not in control. 

Because the process is out of control, you should attempt to identify the 
special causes associated with each out-of-control point. Observation 1, for 
example, has too much coating. Perhaps the coating mechanism became 
stuck for an instant while applying the spray to that item. The other three 
observations indicate too little coating on the associated products. In talking 
with the operator, you might learn that he had not added coating material to 
the sprayer on schedule, so there was insufficient material to spray. 

It is common practice in SPC to note the special causes either on the front 
of the control chart (if there is room) or on the back. This is a convenient 
way of keeping records of special causes. 

In many instances, proper investigation leads to identification of the spe¬ 
cial causes underlying out-of-control processes. However, there might be 
out-of-control points whose special causes cannot be identified. 


The Range Chart 

The X chart provides information about the variation around the average 
value for each subgroup. It is also important to know whether the range 
of values is stable from group to group. In the coating example, if some ob¬ 
servations exhibit very large ranges and others very small ranges, you might 
conclude that the sprayer is not functioning consistently over time. To test 
this, you can create a control chart of the average subgroup ranges, called a 
range chart. As with the x chart, the width of the control limits depends on 
the variability within each subgroup. If a is known, the control limits for 
the range chart are 

LCL = D^cr 
Center line = d 2 (T 
UCL = D 20 - 

and if a is not known, the control limits are 

LCL = Dj{ 

UCL = 

where dj- ^ 3 > ^ind are the correction factors from Table 12-3, and R 

is the average subgroup range. It’s important to note that the x chart is valid 
only when the range is in control. For this reason the range chart is usually 
drawn alongside the x chart. 
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Use the information in the Coats workbook to determine whether the 
range of coating weights is in control. 

To create a range chart of the weight values: 

1 Retnrn to the Coating Data worksheet. 

2 Click QC Charts from the StatPlns menn and click Range Chart. 

3 Select Weight as yonr Data Valnes variable and Time as the Snb- 
gronp variable. 

4 Verify that the Sigma Known checkbox is unselected. 

5 Direct the ontpnt to a new chart sheet named Range Chart. Click 
OK twice. 



value is 
not within 
control limits 


Each point on the range chart represents the range within each snbgronp. 
The average snbgronp range is 3.25, with the control limits going from 0 to 
10.62. According to the range chart shown in Fignre 12-10, only the 27th ob¬ 
servation has an ont-of-control valne. The special canse shonld be identified 
if possible. However, in discnssing the problem with the operator, sometimes 
yon might not be able to determine a special canse. This does not necessarily 
mean that no special canse exists; it conld mean instead that yon are nnable 
to determine what the canse is in this instance. It is also possible that there 
really is no special canse. However, becanse yon are constrncting control 
charts with the width of abont 3 standard deviations, an ont-of-control valne 
is nnlikely nnless there is something wrong with the process. 
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You might have noticed that the range chart identifies as ont of control a 
point that was apparently in control in the x chart hnt does not identify any 
of the fonr observations that are out of control in the x chart. This is a com¬ 
mon occnrrence. For this reason, the x chart and range charts are often nsed 
in conjnnction to determine whether a process is in control. In practice, the 
X chart and range chart often appear on the same page hecanse viewing hoth 
charts simnltaneonsly improves the overall pictnre of the process. In this 
example, yon wonld judge that the process is out of control with hoth charts 
hnt on the basis of different observations. 

Yon can close the Coats Control Chart workbook now, saving yonr 
changes. 


The C Chart 

Both the X chart and the range chart measnre the valnes of a particular vari¬ 
able. Now let’s look at an attribnte chart that measnres an attribnte of the 
process. Some processes can be described by connting a certain featnre, 
snch as the nnmber of flaws in a standardized section of continnons sheet 
metal or the nnmber of defects in a prodnction lot. The nnmber of accidents 
in a plant might also be connted in this manner. A C chart displays control 
limits for the connts attribnte. The lower and npper control limits are 

LCL = c — sVh 

UCL = ~c + 3\/h 

where c is the average nnmber of counts in each snbgronp. If the LCL is 
less than zero, by convention it will be set to eqnal zero, hecanse a negative 
connt is impossible. 


C Chart Example: Factory Accidents 


The Accidents workbook contains the nnmber of accidents that occnrred 
each month dnring a period of a few years at a prodnction site. Let’s create 
control charts of the nnmber of accidents per month to determine whether 
the process is in control. 


1 

2 


To open the Accidents workbook: 

□pen the Accidents workbook from the Chapterl2 folder. 

Save the workbook as Accidents Control Chart. See Fignre 12-11. 
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Figure 12-11 
The Accidents 
workbook 



The range names have heen defined for the workbook in Table 12-5. 


Table 12-5 The Accidents Workbook 

Range Name Range 

Month A2:A45 

Accidents B2:B45 


Description 

The month 

The nnmber of accidents that month 


To create a C chart for accidents at this firm: 

1 Click QC Charts from the StatPlns menn and then click C Chart. 

2 Select Accidents as the Data Valnes variable. 

3 Direct the ontpnt to a new chart sheet named C Chart. 

4 Click OK. Excel generates the chart shown in Fignre 12-12. 
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Figure 12-12 
C chart 
of the number 
of accidents 
per month 


number of 
accidents per 
month exceeded 
control limits 



Each point on the C chart in Figure 12-12 represents the number of 
accidents per month. The average number of accidents per month was 7.11. 
Only in the seventh month did the number of accidents exceed the upper 
control limit of 15.12 with 16 accidents. Since then, the process appears 
to have been in control. Of course, it is appropriate to determine the spe¬ 
cial causes associated with the large number of accidents in the seventh 
month. In the case of this firm, the workload was particularly heavy during 
that month, and a substantial amount of overtime was required. Because 
employees put in longer shifts than they were accustomed to working, fa¬ 
tigue is likely to have been the source of the extra accidents. 

You can close the Accidents Control Chart workbook, saving your results. 


The P Chart 

Closely related to the C chart is the P chart, which depicts the proportion of 
items with a particular attribute, such as defects. The P chart is often used to 
analyze the proportion of defects in each subgroup. 

Let p denote the average proportion of the sample that is defective. The 
distribution of the proportions can be approximated by the normal distribu¬ 
tion, provided that np and n(l — p) are both at least 5. If p is very close to 0 
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or 1, a very large subgroup size might be required for the approximation to 
be legitimate. 

The lower and npper control limits are 


LCL = p- 


3 ^ 


UCL — p 4- 3 


p(i - p) 


P Chart Example: Steel Rod Defects 


A mannfactnrer of steel rods regnlarly tests whether the rods will withstand 
50% more pressnre than the company claims them to be capable of with¬ 
standing. A rod that fails this test is defective. Twenty samples of 200 rods 
each were obtained over a period of time, and the nnmber and fraction of 
defects were recorded in the Steel workbook. 


To open the Steel workbook: 


1 Open the Steel workbook from the Chapterl2 data folder. 

2 Save yonr workbook as Steel Control Chart. See Fignre 12-13. 


Figure 12-13 
The Steel 
workbook 
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The range names have heen defined for the workbook in Table 12-6. 


Table 12-6 The Steel Workbook 


Range Name 

Range 

Snbgronp 

A2:A21 

N 

B2:B21 

Defects 

C2:C21 

Percentage 

D2:D21 


Description 

The snbgronp nnmber 
The size of the snbgronp 
The nnmber of defects in the snbgronp 
The fraction of defects in the snbgronp 


To create a P chart for the percentage of steel rod defects: 

1 Click QC Charts from the StatPlns menn and click P Chart. 

2 Click the Proportions bntton and select Percentage from the list of 
range names. Click OK. 

3 Type 200 in the Sample Size box, becanse each snbgronp has the 
same sample size. 

4 Send the ontpnt to a new chart sheet named P Chart. 

5 Click OK. 

Excel generates the P chart shown in Fignre 12-14. 
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As shown in Fignre 12-14, the lower control limit is 0.01069, or a defect 
percentage of ahont 1%. The npper control limit is 0.11281, or ahont 11%. 
The average defect percentage is 0.06175, ahont 6%. The control chart clearly 
demonstrates that no point is anywhere near the 3-cr limits. 

Note that not all ont-of-control points indicate the existence of a problem. 
For example, snppose that another sample of 200 rods was taken and that 
only one rod failed the stress test. In other words, only one-half of 1% of the 
sample was defective. In this case, the proportion is 0.005, which falls be¬ 
low the lower control limit, so technically it is ont of control. Yet yon wonld 
not be concerned ahont the process being ont of control in this case, becanse 
the proportion of defects is so low. Still, yon might be inclined to investi¬ 
gate, jnst to see whether yon conld locate the sonrce of yonr good fortnne 
and then dnplicate it! 

Yon can save and close the Steel Control Chart workbook now. 


Control Charts for Individual Observations 

Up to now, we’ve been creating control charts for processes that can be 
neatly divided into snbgronps. Sometimes it’s not possible to gronp the data 
into snbgronps. This conld occnr when each measnrement represents a sin¬ 
gle batch in a process or when the measnrements are widely spaced in time. 
With a snbgronp size of 1, it’s not possible to calcnlate snbgronp ranges. 
This makes many of the regnlar formnlas impractical to apply. 

Instead, the recommended method is to create a snbgronp consisting of 
each consecntive observation and then calcnlate the moving average of the 
data. Thns the snbgronp variation is determined by the variation from one 
observation to another, and that variation will be nsed to determine the con¬ 
trol limits for the variation between snbgronps. Becanse we are setting np 
onr snbgronps differently, the formnlas for the lower and npper control lim¬ 
its are different as well. The LCL and UCL are 

LCL = X - 3^ 
dz 

UCL = X -f 3^ 
dz 

Here x is the sample average of all of the observations, R is the average 
range of consecntive valnes in the data set, and dj is control limit fac¬ 
tor shown earlier in Table 12-3. We are nsing a moving average of size 2, so 
this will be eqnal to 1.128. Control charts based on these limits are called 

individuals charts. 
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We can also create a moving range chart of the moving range values, that 
is, the range between consecutive values. In this case, the lower and upper 
control limits match the ones used earlier for the range chart. 

LCL = D 3 R 

UCL = D^R 

Let’s apply these formulas to a workbook recording the tensile strength of 
25 steel samples. The values are stored in the Strength workbook. 

To open the Strength workbook: 

1 Open the Strength workbook from the Chapterl2 data folder. 

2 Save your workbook as Strength Control Chart. See Figure 12-15. 


Figure 12-15 
The Strength 
workbook 
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The range names are shown in Table 12-7. 

Table 12-7 The Strength Worl(bool< 

Range Name Range Description 

Ohs A2:A26 The observation nnmher 

Strength B2:B26 The tensile strength of the sample, 

measnred to the nearest 500 ponnds in 
1,000-ponnd nnits 


To create an Individuals chart for the steel samples: 

1 Click QC Charts from the StatPlns menu and click Individuals 
Chart. 

2 Select Strength for the Data Values variable. 

3 Send the output to a new chart sheet named I-Chart. 

4 Click OK. Excel generates the I chart shown in Figure 12-16. 


Figure 12-16 
The individuals 
chart for tensile 
strength samples 


upward trend 
may indicate a 
process that is 
not in control 
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The chart shown in Fignre 12-16 gives the valnes of the individnal 
observations (not the moving averages] plotted alongside the npper and 
lower control limits. No valnes fall ontside the control limits; this leads ns 
to conclnde that the process is in control. However, the last eight observa¬ 
tions are all either above or near the center line; this might indicate a pro¬ 
cess going ont of control toward the end of the process. This is something 
that shonld be investigated fnrther. 

We shonld also plot the moving range chart, to see whether there is any 
evidence in that plot of an ont-of-control process. 

To create a moving range chart for the steel samples: 

1 Click QC Charts from the StatPlns menn and then click Moving 
Range Chart. 

2 Select Strength for the Data Valnes variable. 

3 Send the ontpnt to a new chart sheet named MR-Chart. 

4 Click OK. Excel generates the chart shown in Fignre 12-17. 



trend in the 
moving range 
chart indicates 
a process not 
in control 


The chart in Fignre 12-17 shows additional indications of a process that 
is not in control. The last seven valnes all fall below the center line, and 
there appears to be a generally downward trend to the ranges from the sixth 
observation on. We wonld conclnde that there is snfficient evidence to war¬ 
rant fnrther investigation and analysis. 

Yon can save and close the Strength Control Chart workbook now. 
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The Pareto Chart 


After you have determined that your process is resulting in an unusual 
number of problems, such as defects or accidents, the next natural step is 
to determine what component in the process is causing the problems. This 
investigation can be aided by a Pareto chart, which creates a bar chart of the 
causes of the problem in order from most to least frequent so that you can 
focus attention on the most important elements. The chart also includes the 
cumulative percentage of these components so that you can determine what 
combination of factors causes a certain percentage of the problems. 

The Powder workbook contains data from a company that manufactures 
baby powder. Part of the process involves a machine called a filler, which 
pours the powder into bottles to a specified limit. The quantity of powder 
placed in the bottle varies because of uncontrolled variation, but the final 
weight of the bottle filled with powder cannot be less than 368.6 grams. 
Any bottle weighing less than this amount is rejected and must be refilled 
manually (at a considerable cost in terms of time and labor). Bottles are 
filled from a filler that has 24 valve heads so that 24 bottles can be filled at 
one time. Sometimes a head is clogged with powder, and this causes the 
bottles being filled on that head to receive less than the minimum amount 
of powder. To gauge whether the machine is operating within limits, you 
select random samples of 24 bottles (one from each head) at about one- 
minute intervals over the nighttime shift at the factory. You’ve been asked 
to examine the data and determine which part of the filler is most respon¬ 
sible for defective fills. 

To open the Powder workbook: 

1 Open the Powder workbook from the Chapterl2 data folder. 

2 Save the workbook as Powder Pareto Chart. See Figure 12 - 18 . 
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Figure 12-18 
The Powder 
workbook 



The 

Figure 


following range names have heen defined for the workbook in 
12-18: 


Table 12-8 The Powder Workbook 


Range Name 

Range 

Description 

Time 

A2:A352 

The time of the sample 

Head_01 

B2:B352 

Quantity of powder from head 1 

Head_02 

C2:C352 

Quantity of powder from head 2 

Head_24 

Y2:Y352 

Quantity of powder from head 24 


Now generate the Pareto chart nsing StatPlns. 

To create the Pareto chart: 

1 Click QC Charts from the StatPlns menn and then click Pareto Chart. 

2 Click the Values in separate columns option hntton. 
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J Click the Data Values button and then select the range names from 
Head 01 to Head 24 in the range names list (do not select the Time 
variable). Click OK. 

4 Click the Data valnes represent drop-down list box and select Error 
occurs with a value less than. 

5 Type 368.6 in the text box below the drop-down list box. 

6 Click the Output button and direct the output to a new chart sheet 
named Pareto Chart. Click OK. 

Figure 12-19 shows the completed dialog box. 


Figure 12-19 
The Create a 
Pareto Chart 
dialog box 



7 Click OK. 

Excel generates the chart shown in Figure 12-20. 
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more of the defects come 

from filler head 18 than plot of cumulative 



The Pareto chart displayed in Figure 12-20 shows that a majority of 
the rejects come from a few heads. Filler head 18 accounts for 87 of the 
defects, and the first three heads in the chart (18, 14, and 23) account for 
almost 40% of all of the defects. There might he something physically 
wrong with the heads that made them more liable to clogging up with 
powder. If rejects were being produced randomly from the filler heads, 
you would expect that each filler head would produce 1/24, or about 4%, 
of the total rejects. Using the information from the Pareto chart shown in 
Figure 12-18, you might want to repair or replace those three heads in or¬ 
der to reduce clogging. 

You can close the Powder Pareto Chart workbook now, saving your 
changes. 
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Exercises 


1. True or false, and why? The purpose of 
statistical process control is to eliminate 
all variation from a process. 

2. True or false, and why? As long as the 
process valnes lie between the control 
limits, the process is in control. 

3. Calcnlate the control limits for an 
X chart where 

a. n = 9, iJi = 50, and o- = 5. 

b. n = 9, X = 50 i? = 8, and cr is 
nnknown. 

4. Calcnlate the control limits for a range 
chart where 

a. n = 4, i? = 10, and tr = 4. (What is 
the valne of the center line?) 

b. n = 4, i? = 10, and a is nnknown. 

5. Calcnlate the control limits for a C chart 
where 

a. c = 16. 

b. c = 22. 

6. Calcnlate the control limits for a P chart 
where 

a. n = 25 and p = 0.5. 

b. n = 25 and p = 0.2. 

7. Retnrn to the Teacher workbook from 
this chapter and perform the following 
analysis: 

a. Open the Teacher workbook from 
the Chapterl2 folder and save it as 
Teacher Control Chart 2. 

b. Redo the x chart; this time do not 
assnme a value for a. 

c. Create a range chart of the data; once 
again, do not assnme a valne for cr. 

d. Examine yonr control charts. Is there 
evidence that the teacher’s grades are 
not in control? Report yonr conclnsions 
and save yonr changes to the workbook. 


8. A can mannfactnring company mnst be 
carefnl to keep the width of its cans con¬ 
sistent. One associated problem is that the 
metalworking tools tend to wear down 
dnring the day. To compensate, the pres- 
snre behind the tools is increased as the 
blades become worn. In the Cans work¬ 
book, the width of 39 cans is measnred at 
fonr randomly selected points. Perform 
the following analysis on the data: 

a. Open the Cans workbook from the 
Chapterl2 folder and save it as Cans 
Control Chart. 

b. Use range and x charts to determine 
whether the process is in control. 

Does the pressnre-compensation 
scheme seem to correct properly for 
the tool wear? If not, snggest some 
special canses that seem still to be 
present in the process. 

c. Report yonr resnlts, saving yonr 
changes to the workbook. 

9. Jnst-in-time inventory management is 
an important tool in project manage¬ 
ment. The OnTime workbook contains 
data regarding the proportion of on-time 
deliveries dnring each month over a 
two-year period for each of several 
paperboard prodncts (cartons, sheets, 
and total). The Total colnmn inclndes 
cartons, sheets, and other prodncts. 
Becanse sheets were not prodnced for 
the entire two-year period, a few data 
points are missing for that variable. 
Assnme that 1,044 deliveries occnrred 
dnring each month. Perform the follow¬ 
ing analysis on the data: 

a. Open the OnTime workbook from the 
Chapterl2 folder, saving it as OnTime 
Control Chart. 

b. For each of these prodncts (Cartons, 
Sheets, and Total), nse a P chart to 
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determine whether the delivery pro¬ 
cess is in control. If not, suggest some 
special causes that might exist, 
c. Report your results and save your 
changes to the workbook. 

10. A steel sheet manufacturer is concerned 
about the number of defects, such as 
scratches and dents, that occur as the 
sheet is made. In order to track defects, 
10-foot lengths of sheet are examined at 
regular intervals. For each length, the 
number of defects is counted. Analyze 
these data to determine whether the pro¬ 
cess is in control. 

a. Open the Sheet workbook from the 
Chapterl2 folder and save the file as 
Sheet Control Chart, 
h. Determine whether the process is in 
control. If it is not, suggest some spe¬ 
cial causes that might exist, 
c. Save your changes to the workbook 
and write a report summarizing your 
conclusions. 

11. A firm is concerned about safety in its 
workplace. This company does not con¬ 
sider all accidents to be identical. Instead, 
it calculates a safety index, which 
assigns more importance to more serious 
accidents. Examine the data from their 
study and perform the following analysis: 

a. Open the Safety workbook from the 
Chapterl2 folder and save it as Safety 
Control Chart. 

h. Construct a C chart for the data to de¬ 
termine whether safety is in control at 
this firm. 

c. Save your changes to the workbook 
and report your conclusions. 

12. A manufacturer subjects its steel bars 
to stress tests to be sure they are up to 
standard. Three bars were tested in each 
of 23 subgroups. The amount of stress 
applied before the bar breaks is recorded 
by the manufacturer. 


a. Open the Stress workbook from the 
Chapter 12 folder and save it as Stress 
Control Chart. 

h. Create a range chart and a x chart to 
determine whether the production 
process is in control. If it is not, what 
factors might be contributing to the 
lack of control? 

c. Report your results, saving your 
changes to the workbook. 

13. A steel rod manufacturer has contracted 
to supply rods 180 millimeters in length 
to one of its customers. Because the cut¬ 
ting process varies somewhat, not all 
rods are exactly the desired length. Five 
rods were measured from each of 33 
subgroups during a week. Analyze these 
data to determine whether the process is 
in control. 

a. Open the Rod workbook from the 
Chapterl2 folder and save it as 
Rod Control Chart, 
h. Create range and x charts to deter¬ 
mine whether the cutting process is in 
statistical control. 

c. Save your changes to the workbook 
and write a report summarizing your 
conclusions. 

14. An amusement park sampled customers 
leaving the park over an 18-day period. 
The total number of customers and the 
number of customers who indicated 
they were satisfied with their experience 
in the park were recorded in an Excel 
workbook. You've been asked to analyze 
these data to determine whether the 
percentage of satisfied customers is in 
statistical control. 

a. Open the Satisfy workbook from the 
Chapterl2 folder and save it as Satisfy 
Control Chart. 

h. Create the appropriate control chart for 
the percentage of satisfied customers. 
Is there any indication that the process 
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is out of control? What factors, if any 
might have contrihnted to this? 

c. Save yonr changes to the v\rorkhook 
and write a report of yonr observa¬ 
tions and conclnsions. 

15. The nnmher of flaws on the snrfaces of 
a particnlar model of antomohile leav¬ 
ing the plant was recorded in the Antos 
workbook for each of 40 antomobiles 
dnring a one-week period. 

a. Open the Autos workbook from the 
Chapter2 folder and save it as Autos 
Control Chart. 

b. Create a control chart of the connt of 
anto flaws. Is this process in control? 

c. Save yonr changes to the workbook 
and report yonr results. 

16. You’ve learned in this chapter that filler 
head 18 is a major factor in the nnmher 
of defective fills. To investigate further, 
you decide to look at the head 18 values 
from the data set to determine at what 
points in time the head was ont of statis¬ 
tical control. 

a. Open the Powder workbook from the 
Chapterl2 folder and save it as Powder 
Control Chart. 

b. Create an Individnals chart and a 
moving range chart of the Head 18 
valnes. At what times are the head 
valnes beyond the control limits? 

c. Repeat part b for filler heads 14 and 23. 

d. Interpret yonr findings in light of the 
fact that a new shift comes in at mid¬ 
night. Does this fact affect the filler 
process? 

e. Save yonr changes to the workbook 
and write a report snmmarizing yonr 
observations. 

17. Weather can be considered a process 
with process variables snch as tem- 
peratnre and precipitation and attribnte 
variables snch as the nnmher of 


hnrricanes and tornadoes in a given 
season. One theory of meteorology 
holds that climatic changes in this 
process take place over long periods 
of time, whereas over short periods of 
time, the process shonld be stable. On 
the other hand, concerns have been 
raised abont the effect of COj emis¬ 
sions on the atmosphere, which may 
lead to major changes in the weather. 
Yon’ve been given the yearly tem- 
peratnre valnes for northern Illinois 
from 1895 to 1998, saved in an Excel 
workbook. 

a. Open the TemplOO workbook from 
the Chapterl2 folder and save it as 
TemplOO Control Chart. 

b. Create an Individnals chart and a 
moving range chart of the average 
yearly temperatnre. 

c. What is the average yearly tempera¬ 
tnre? What are the lower and npper 
control limits? Do the temperatnre 
valnes appear to be in statistical 
control? 

d. Create a moving range chart of the 
average yearly temperatnre. Does this 
chart show any violations of process 
control? 

e. Save yonr changes to the workbook 
and write a report snmmarizing yonr 
results. 

18. The RainlOO workbook contains the 
total precipitation for northern Illinois 
from 1895 to 1998. 

a. Open the RainlOO workbook from 
the Chapterl2 folder and save it as 
RainlOO Control Chart. 

b. Create an individnals chart of the 
total precipitation. 

c. Create a moving range chart of the 
total precipitation. 

d. Does the process appear to be in sta¬ 
tistical control? Save yonr workbook 
and report yonr conclnsions. 
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19. The Tornado workbook records the 
number of tornadoes of various lev¬ 
els of severity in Kansas from 1950 to 
1999. Tornadoes are rated on the Fujita 
Tornado Scale, which ranges from minor 
tornadoes rated at FO to major tornadoes 
rated at F5. You’ve been asked to deter¬ 
mine whether the number of tornadoes 
has changed over this period of time. 

a. Open the Tornado workbook from 
the Chapterl2 folder and save it as 
Tornado Control Chart. 

b. Create a C chart of the number of 
tornadoes each year for each 


severity level and then for all types of 
tornadoes. 

c. Which classes of tornadoes show 
signs of being out of statistical con¬ 
trol? Describe the problem. 

d. Techniques in recording and counting 
tornadoes have improved in the last 
few decades, especially for minor 
tornadoes. Explain how this fact may 
be related to the results you noted in 
part c. 

e. Save your changes to the workbook 
and report your results. 
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The Excel Reference contains the following: 
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P- Excel’s Math and Statistical Fnnctions 
P- StatPlns™ Commands 
P- StatPlns™ Math and Statistical Fnnctions 
P- Bibliography 
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Excel’s Data Analysis ToolPak 


The Analysis ToolPak add-ins that come with Excel enable yon to perform 
basic statistical analysis. None of the ontpnt from the Analysis ToolPak is 
npdated for changing data, so if the sonrce data change, yon will have to 
rernn the command. To nse the Analysis ToolPak, yon mnst first verify that 
it is available to yonr workbook. 

To check whether the Analysis ToolPak is available: 

1 Click the Office bntton and click Excel Options. 

2 Click Add-Ins from the list of Excel Options and then click the Go 
bntton next to the Manage Excel Add-Ins list box. 

3 Select the Analysis ToolPak checkbox from the Add-Ins dialog box 
to activate the Analysis ToolPak add-in and click the OK bntton. 

4 Verify that the add-in is activated by clicking the Data tab and 
verifying that the Data Analysis bntton appears in the Analysis 
gronp. 


The rest of this section docnments each Analysis ToolPak command, 
showing each corresponding dialog box and describing the featnres of the 
command. 


Output Options 

All the dialog boxes that prodnce ontpnt share the following ontpnt storage 
options: 


Output Range 

Click to send ontpnt to a cell in the cnrrent worksheet, and then type the 
cell; Excel nses that cell as the npper left corner of the range. 


New Worksheet Ply 

Click to send ontpnt to a new worksheet; then type the name of the 
worksheet. 
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New Workbook 


Click to send output to a new workbook. 


Anova: Single Factor 

The Anova: Single Factor command calculates the one-way analysis of vari¬ 
ance, testing whether means from several samples are equal. 



Input Range 

Enter the range of worksheet data you want to analyze. The range must be 
contiguous. 

Grouped By 

Indicate whether the range of samples is grouped by columns or by rows. 

Labels in First Row/Column 

Indicate whether the first row (or column) includes header information. 

Alpha 


Enter the alpha level used to determine the critical value for the F statistic. 

See “Output Options” at the beginning of this section for information on 
the output storage options. 
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Anova: Two-Factor With Replication 


The Anova: Two-Factor With Replication command calculates the two-way 
analysis of variance with multiple observations for each combination of the 
two factors. An analysis of variance table is created that tests for the signifi¬ 
cance of the two factors and the significance of an interaction between the 
two factors. 



Input Range 

Enter the range of worksheet data you want to analyze. The range must be 
rectangular, the columns representing the first factor and the rows repre¬ 
senting the second factor. An equal number of rows are required for each 
level of the second factor. 


Rows per Sample 

Enter the number of repeated values for each combination of the two factors. 


Alpha 


Enter the alpha level used to determine the critical value for the F statistic. 

See “Output Options” at the beginning of this section for information on 
the output storage options. 
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Anova: Two-Factor Without Replication 


The Anova: Two-Factor Without Replication command calculates the two- 
way analysis of variance with one observation for each comhination of the 
two factors. An analysis of variance table is created that tests for the signifi¬ 
cance of the two factors. 



Input Range 

Enter the range of worksheet data you want to analyze. The range must be 
contiguous, with each row and column representing a combination of the 
two factors. 


Labels 

Indicate whether the first row (or column) includes header information. 

Alpha 


Enter the alpha level used to determine the critical value for the F statistic. 

See “Output Options” at the beginning of this section for information on 
the output storage options. 
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Correlation 


The Correlation command creates a table of the Pearson correlation coef¬ 
ficient for values in rows or columns on the worksheet. 



Input Range 

Enter the range of worksheet data you want to analyze. The range must be 
contiguous. 


Grouped By 

Indicate whether the range of samples is grouped by columns or by rows. 

Labels in First Row/Column 

Indicate whether the first row (or column] includes header information. 

See “Output Options” at the beginning of this section for information on 
the output storage options. 


526 












Covariance 

The Covariance command creates a table of the covariance for values in 
rov\rs or columns on the v\rorksheet. 



Input Range 

Enter the range of worksheet data you want to analyze. The range must be 
contiguous. 


Grouped By 

Indicate whether the range of samples is grouped by columns or by rows. 

Labels in First Row/Column 

Indicate whether the first row (or column) includes header information. 

See “Output Options” at the beginning of this section for information on 
the output storage options. 


Descriptive Statistics 

The Descriptive Statistics command creates a table of univariate descriptive 
statistics for values in rows or columns on the worksheet. 
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Input Range 

Enter the range of worksheet data yon want to analyze. The range mnst he 
contignons. 


Grouped By 

Indicate whether the range of samples is gronped hy colnmns or hy rows. 

Labels in First Row/Column 

Indicate whether the first row (or colnmn) inclndes header information. 


Confidence Level for Mean 

Click to print the specified confidence level for the mean in each row or col¬ 
nmn of the inpnt range. 


Kth Largest 

Click to print the kth largest valne for each row or colnmn of the inpnt range; 
enter the valne for k in the corresponding hox. 
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Kth Smallest 


Click to print the kth smallest value for each row or column of the input 
range; enter the value for k in the corresponding hox. 


Summary Statistics 

Click to print the following statistics in the output range: Mean, Standard 
Error (of the mean), Median, Mode, Standard Deviation, Variance, Kurtosis, 
Skewness, Range, Minimum, Maximum, Sum, Count, Largest (#), Smallest 
(#), and Confidence Level. 

See “Output Options” at the beginning of this section for information on 
the output storage options. 


Exponential Smoothing 

The Exponential Smoothing command creates a column of smoothed aver¬ 
ages using simple one-parameter exponential smoothing. 



Input Range 

Enter the range of worksheet data you want to analyze. The range must he a 
single row or a single column. 

Damping Factor 

Enter the value of the smoothing constant. The value 0.3 is used as a default 
if nothing is entered. 
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Labels 


Indicate whether the first row (or column] includes header information. 


Chart Output 

Click to create a chart of observed and forecasted values. 


Standard Errors 

Click to create a column of standard errors to the right of the forecasted 
column. 


Output options 

You can send output from this command only to a cell on the current 
worksheet. 


F-Test: Two-Sample for Variances 

The F-Test: Two-Sample for Variances command performs an F test to de¬ 
termine whether the population variances of two samples are equal. 



Variable I Range 

Enter the range of the first sample, either a single row or a single column. 
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Variable 2 Range 


Enter the range of the second sample, either a single row or a single colnmn. 


Labels 

Indicate whether the first row (or colnmn) inclndes header information. 

Alpha 


Enter the alpha level nsed to determine the critical valne for the F-statistic. 

See “Ontpnt Options” at the beginning of this section for information on 
the ontpnt storage options. 


Histogram 

The Histogram command creates a freqnency table for data valnes located 
in a row, colnmn, or list. The freqnency table can be based on defanlt or cns- 
tomized bin widths. Additional ontpnt options inclnde calcnlating the cn- 
mnlative percentage, creating a histogram, and creating a histogram sorted 
in descending order of freqnency (also known as a Pareto chart). 



Input Range 

Enter the range of worksheet data yon want to analyze. The range mnst be a 
row, colnmn, or rectangnlar region. 
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Bin Range 


Enter an optional range of valnes that defines the honndaries of the hins. 


Labels 


Indicate whether the first row (or colnmn) inclndes header information. 


Pareto (sorted histogram) 

Click to create a Pareto chart sorted hy descending order of freqnency. 


Cumulative Percentage 

Click to calcnlate the cnmnlative percentages. 


Chart Output 

Click to create a histogram of freqnency versns hin valnes. 

See “Ontput Options” at the beginning of this section for information on 
the ontpnt storage options. 


Moving Average 

The Moving Average command creates a colnmn of moving averages over 
the preceding observations for an interval specified by the nser. 
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Input Range 


Enter the range of worksheet data for which yon want to calcnlate the mov¬ 
ing average. The range mnst he a single row or a single colnmn containing 
fonr or more cells of data. 


Labels in First Row 

Indicate whether the first row (or colnmn] inclndes header information. 

Interval 


Enter the nnmher of cells yon want to inclnde in the moving average. The 
defanlt valne is three. 


Chart Output 

Click to create a chart of observed and forecasted valnes. 


Standard Errors 

Click to create a colnmn of standard errors to the right of the forecasted 
column. 


Output options 

You can only send output from this command to a cell on the current 
worksheet. 


Random Number Generation 

The Random Number Generation command creates columns of random 
numbers following a user-specified distribution. 
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Number of Variables 

Enter the number of columns of random variables you want to generate. If 
no value is entered, Excel fills up all available columns. 

Number of Random Numbers 

Enter the number of rows in each column of random variables you want to 
generate. If no value is entered, Excel fills up all available columns. This 
command is not available for the patterned distribution (see below). 

Distribution 

Click the down arrow to open a list of seven distributions from which you 
can choose to generate random numbers and then specify the parameters of 
that distribution. 

Random Seed 

Enter an optional value used as a starting point, called a random seed, for gen¬ 
erating a string of random numbers. You need not enter a random seed, but us¬ 
ing the same random seed ensures that the same string of random numbers will 
be generated. This box is not available for patterned or discrete random data. 

See “Output Options” at the beginning of this section for information on 
the output storage options. 
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Rank and Percentile 

The Rank and Percentile command produces a table with ordinal and per¬ 
centile values for each cell in the input range. 



Input Range 

Enter the range of worksheet data you want to analyze. The range must be 
contiguous. 


Grouped By 

Indicate whether the range of samples is grouped by columns or by rows. 

Labels in First Row/Column 

Indicate whether the first row (or column) includes header information. 

See “Output Options” at the beginning of this section for information on 
the output storage options. 


Regression 

The Regression command performs multiple linear regression for a variable 
in an input column based on up to 16 predictor variables. The user has the 
option of calculating residuals and standardized residuals and producing 
line fit plots, residuals plots, and normal probability plots. 
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InputY Range 

Enter a single column of values that will be the response variable in the lin¬ 
ear regression. 


Input X Range 

Enter up to 16 contiguous columns of values that will be the predictor vari¬ 
ables in the regression. 


Labels 


Indicate whether the first row of the Y range and that of the X range include 
header information. 


Constant is Zero 

Click to include an intercept term in the linear regression or to assume that 
the intercept term is zero. 
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Confidence Level 


Click to indicate a confidence interval for linear regression parameter esti¬ 
mates. A 95% confidence interval is antomatically inclnded; enter a different 
one in the corresponding box. 


Residuals 


Click to create a colnmn of residuals (observed—predicted) values. 


Residual Plots 

Click to create a plot of residuals versus each of the predictor variables in 
the model. 


Standardized Residuals 

Click to create a column of residuals divided by the standard error of the 
regression’s analysis of variance table. 


Line Fit Plots 

Click to create a plot of observed and predicted values against each of the 
predictor variables. 

Normal Probability Plots 

Click to create a normal probability plot of the Y variable in the Input Y 
Range. 

See “Output Options” at the beginning of this section for information on 
the output storage options. 


Sampling 

The Sampling command creates a sample of an input range. The sample can 
be either random or periodic (sampling values a fixed number of cells apart). 
The sample generated is placed into a single column. 
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Input Range 

Enter the range of worksheet data you want to sample. The range must he 
contiguous. 


Labels 

Indicate whether the first row of the Y range and that of the X range include 
header information. 

Sampling Method 

Click the sampling method you want. 

Periodic 

Click to sample values from the input range period cells apart; enter a value 
for period in the corresponding hox. 

Random 


Click to create a random sample the size of which you enter in the corre¬ 
sponding hox. 

See “Output Options” at the beginning of this section for information on 
the output storage options. 
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t-Test: Paired Two Sample for Means 

The t-Test: Paired Two Sample for Means command calculates the paired 
two-sample Student’s t-test. The output includes hoth the one-tail and the 
two-tail critical values. 



Variable I Range 

Enter the input range of the first sample; it must he a single row or column. 


Variable 2 Range 

Enter the input range of the second sample; it must he a single row or column. 


Hypothesized Mean Difference 

Enter a mean difference value with which to calculate the t-test. If no value 
is entered, a mean difference of zero is assumed. 


Labels 


Indicate whether the first row of the Y range and that of the X range include 
header information. 
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Alpha 


Enter an alpha valne nsed to calculate the critical values of the t shown in 
the output. 

See “Output Options” at the beginning of this section for information on 
the output storage options. 


t-Test: Two-Sample Assuming Equal Variances 

The f-Test: Two-Sample Assuming Equal Variances command calculates the 
unpaired two-sample Student’s f-test. The test assumes that the variances 
in the two groups are equal. The output includes hoth the one-tail and the 
two-tail critical values. 



Variable I Range 

Enter an input range for the first sample; it must he a single row or column. 


Variable 2 Range 

Enter an input range for the second sample; it must he a single row or column. 


Hypothesized Mean Difference 

Enter a mean difference with which to calculate the f-test. If no value is en¬ 
tered, a mean difference of zero is assumed. 
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Labels 


Indicate whether the first row of the Y range and that of the X range inclnde 
header information. 


Alpha 


Enter an alpha valne nsed to calcnlate the critical valnes of the t shown in 
the ontpnt. 

See “Ontpnt Options” at the beginning of this section for information on 
the ontpnt storage options. 


t-Test: Two-Sample Assuming Unequal 
Variances 

The f-Test: Two-Sample Assuming Unequal Variances command calcnlates 
the nnpaired two-sample Stndent’s f-test. The test allows the variances in 
the two gronps to he nneqnal. The ontpnt inclndes hoth the one-tail and the 
two-tail critical valnes. 



Variable I Range 

Enter the inpnt range of the first sample; it mnst he a single row or colnmn. 

Variable 2 Range 

Enter the inpnt range of the second sample; it mnst he a single row or colnmn. 
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Hypothesized Mean Difference 


Enter a mean difference with which to calcnlate the t test. If no valne is en¬ 
tered, a mean difference of zero is assnmed. 


Labels 

Indicate whether the first row of the Y range and that of the X range incinde 
header information. 

Alpha 


Enter an alpha valne nsed to calcnlate the critical valnes of the t shown in 
the ontpnt. 

See “Ontpnt Options” at the beginning of this section for information on 
the ontpnt storage options. 


z-Test: Two Sample for Means 

The z-Test: Two Sample for Means command calculates the unpaired two 
sample z test. The test assumes that the variances in the two groups are 
known (though not necessarily equal to each other). The output includes 
hoth the one-tail and the two-tail critical values. 
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Variable I Range 

Enter an inpnt range of the first sample; it mnst be a single row or colnmn. 

Variable 2 Range 

Enter an inpnt range of the second sample; it mnst be a single row or 
colnmn. 


Hypothesized Mean Difference 

Enter a mean difference with which to calcnlate the z test. If no valne is en¬ 
tered, a mean difference of zero is assnmed. 


Variable I Variance (known) 

Enter the known variance s\ of the first sample. 

Variable 2 Variance (known) 

Enter the known variance s\ of the second sample. 

Labels 

Indicate whether the first row of the Y range and that of the X range inclnde 
header information. 

Alpha 


Enter an alpha valne nsed to calcnlate the critical valnes of the z shown in 
the ontpnt. 

See “Ontpnt Options” at the beginning of this section for information on 
the ontpnt storage options. 
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Excel’s Math and Statistical 
Functions 


This section documents all the functions provided with Excel that are 
relevant to statistics. So that you can more easily find the function you 
need, similar functions are grouped together in six categories: Descriptive 
Statistics for One Variable, Descriptive Statistics for Two or More 
Variables, Distributions, Mathematical Formulas, Statistical Analysis, and 
Trigonometric Formulas. 


Descriptive Statistics for One Variable 


Function Name 

Description 

AVEDEV 

AVEDEV(numherf, number2, . . .) returns the 
average of the (absolute) deviations of the points 
from their mean. 

AVERAGE 

AVERAGE(numherf, number2, . . .) returns the 
average of the numbers (up to 30). 

CONFIDENCE 

CONFIDENCE(a7pha, standarddev, n) returns a 
confidence interval for the mean. 

COUNT 

COUNT(va7uef, value2, . . .) returns how many 
numbers are in the value(s). 

COUNTA 

COUNTA(va7uef, value2, . . .) returns the count of 
nonblanlc values in the list of arguments. 

COUNTBLANK 

COUNTBLANK(range) returns the count of blanlc 
cells in the range. 

COUNTIF 

COUNTIF(range, criteria) returns the count of 
nonblanlc cells in the range that meet the criteria. 

DEVSQ 

DEVSQlnumberl, number2, . . .) returns the sum of 
squared deviations from the mean of the numbers. 

FREQUENCY 

FREQUENCYldata-array, bins-array] returns the 
frequency distribution of data-array as a vertical 
array, on the basis of bins-array. 

GEOMEAN 

GEOMEAN[numberl, number2, . . .) returns the 
geometric mean of up to 30 numbers. 

HARMEAN 

HARMEAN(numherf, number2, . . .) returns the 
harmonic mean of up to 30 numbers. 

KURT 

KUKT[number 1, number2, . . .) returns the kurtosis 
of up to 30 numbers. 

LARGE 

LARGE (array, n) returns the nth-largest value in 
array. 

MAX 

MAX[numberl, number2, . . .) returns the largest of 
up to 30 numbers. 
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MEDIAN 

MIN 

MODE 

PERCENTILE 

PERCENTRANK 

PRODUCT 

RANK 

QUOTIENT 

SKEW 

SMALL 

STANDARDIZE 

STDEV 

STDEVP 

SUM 

SUMIF 

SUMSQ 

TRIMMEAN 


MEDIAN(uumberi, number2, . . .) returns the 
median of up to 30 numbers. 

MlN[numberl, number2, . . .) returns the smallest of 
up to 30 numbers. 

MODE[numberl, number2, . . .) returns the value 
most frequently occurring in up to 30 numbers or in 
a specified array or reference. 

PERCENTILElarray, n) returns the nth percentile of 
the values in array. 

PERCENTRANK(array, value, significant-digits) 
returns the percent rank of the value in the array, 
with the specified number of significant digits 
(optional). 

PRODUCT(nun7herf, number2, . . .) returns the 
product of up to 30 numbers. 

RANK[number, range, order) returns the rank of the 
number in the range. If order = 0, then the range is 
ranked from largest to smallest; if order = 1, then 
the range is ranked from smallest to largest. 
QUOTIENT(c/ivic/enc/, divisor) returns the quotient 
of the numbers, truncated to integers. 
SKEW[numberl, number2, . . .) returns the 
skewness of up to 30 numbers (or a reference to 
numbers). 

SMALL(array, n) returns the nth-smallest number 
in array. 

STANDARDIZE(x, mean, standard deviation) 
normalizes a distribution and returns the z score 
of X. 

STDEV[numberl, number2, . . .) returns the sample 
standard deviation of up to 30 numbers, or of an 
array of numbers. 

STDEVP(numherf, number2, . . .) returns the 
population standard deviation of up to 30 numbers 
or of an array of numbers. 

SUM(numherf, number2, . . .) returns the sum of up 
to 30 numbers or of an array of numbers. 

SUMIF(range, criteria, sum range) returns the sum 
of the numbers in range (optionally in sum-range) 
according to criteria. 

SDMSQlnumberl, number2, . . .) returns the sum 
of the squares of up to 30 numbers or of an array of 
numbers. 

TRIMMEAN(array, percent) returns the mean of a 
set of values in an array, excluding percent of the 
values, half from the top and half from the bottom. 
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VAR 

VAR{numberl, number2, . . .] returns the sample 
variance of np to 30 numbers (or an array or 
reference). 

VARP 

VARP [number 1, number2, . . .) retnrns the 
popnlation variance of np to 30 numbers (or an 
array or reference). 


Descriptive Statistics for Two or More 
Variables 


Function Name 

Description 

CORREL 

CORREL(arrayf, array2) returns the coefficient of 
correlation between arrayl and array2. 

COVAR 

COVARiarrayl, array2) returns the covariance of 
arrayl and array2. 

PEARSON 

PEARSON(arrayf, array2) returns the Pearson 
correlation coefficient between arrayl and array2. 

RSQ 

RSQlknown-y’s, known-x’s) returns the square of 
Pearson’s product moment correlation coefficient. 

SUMPRODUCT 

SUMPRODUCT(arrayf, array2, . . .) returns the sum 
of the products of corresponding entries in up to 30 

SUMX2MY2 

arrays. 

SUMX2MY2 (array!, array2) returns the sum of the 
differences of squares of corresponding entries in 

SUMX2PY2 

two arrays. 

SUMX2PY2(array!, array2) returns the sum of the 
sums of squares of corresponding entries in two 

SUMXMY2 

arrays. 

SUMXMY2(array!, array2) returns the sum of the 
squares of differences of corresponding entries in 
two arrays. 


Distributions 


Function Name 

Description 

BETADIST 

BETADIST(x, alpha, beta, a, b) returns the value of 
the cumulative beta probability density function. 

BETAINV 

BETAINV(p, alpha, beta, a, b) returns the value 
of the inverse of the cumulative beta probability 
density function. 
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BINOMDIST 

CHIDIST 

CHIINV 

CRITBINOM 

EXPONDIST 

FDIST 

FINV 

GAMMADIST 


GAMMAINV 

GAMMALN 

HYPGFOMDIST 

LOGINV 

LOGNORMDIST 


NFGBINOMDIST 

NORMDIST 


BINOMDIST(successes, trials, p, type) returns the 
probability for the binomial disbibntion [type is 
TRUE for cnmnlative distribntion fnnction, FALSE 
for probability mass function). 

CHIDIST(x, df) retnrns the probability for the 
chi-sqnare distribntion. 

CHIINV(p, df) retnrns the inverse of the chi-sqnare 
distribntion. 

CRITBINOM(tria7s, p, alpha) retnrns the smallest 
valne so that the cnmnlative binomial distribntion 
is greater than or eqnal to the criterion valne, alpha. 
EXPONDIST(x, lambda, type) retnrns the probability 
for the exponential distribntion [type is trne for 
the cnmnlative distribntion fnnction, false for the 
probability density fnnction). 

FDIST(x, dfl, df2) retnrns the probability for the 
F distribntion. 

FINV(p, dfl, df2) retnrns the inverse of the 
F distribntion. 

GAMMADIST(x, alpha, beta, type) retnrns the 
probability for the gamma distribution v\rith 
parameters alpha and beta [type is trne for the 
cnmnlative distribntion fnnction, false for the 
probability mass fnnction). 

GAMMAINV(p, alpha, beta) retnrns the inverse of 
the gamma distribntion. 

GAMMALN(x) retnrns the natnral log of the gamma 
fnnction evalnated at x. 

HYPGEOMDIST(samp7e-successes, sample-size, 
population-successes, population-size) retnrns the 
probability for the hypergeometric distribntion. 
LOGINV(p, mean, sd) retnrns the inverse of the 
lognormal distribntion, where the natnral logarithm 
of the distribntion is normally distribnted with 
mean mean and standard deviation sd. 
LOGNORMDIST(x, mean, sd) retnrns the 
probability for the lognormal distribntion, 
where the natnral logarithm of the distribntion 
is normally distribnted with mean mean and 
standard deviation sd. 

NEGBINOMDISTl/aiiures, threshold-successes, 
probability) retnrns the probability for the negative 
binomial distribntion. 

NORMDIST(x, mean, sd, type) retnrns the 
probability for the normal distribution with mean 
mean and standard deviation sd [type is trne for 
the cnmnlative distribntion fnnction, false for the 
probability mass fnnction). 
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NORMINV 

NORMINV(p, mean, sc/j returns the inverse of the 
normal distrihntion with mean mean and standard 
deviation sd. 

NORMSDIST 

NORMSDIST(numher) retnrns the prohahility for 
the standard normal distrihntion. 

NORMSINV 

NORMSINV[probability) retnrns the inverse of the 
standard normal distrihntion. 

POISSON 

POISSON(x, mean, type] retnrns the prohahility for the 
Poisson distrihntion [type is trne for the cnmnlative 
distrihnhon, false for the prohahility mass function). 

TDIST 

TDIST(x, df, number-of-tails) retnrns the prohahility 
for the t distrihntion. 

TINV 

WEIBULL 

TINV(p, df) retnrns the inverse of the t distrihntion. 
WEIBULL(x, alpha, beta, type) retnrns the prohahility 
for the Weihnll distrihntion [type is trne for the 
cnmnlative distrihntion function, false for the 
prohahility mass function). 


Mathematical Formulas 


Function Name 

Description 

ABS 

ABS(numher) returns the absolute value of 

COMBIN 

number to the point specified. 

COMB IN(x, n) returns the number of 

EVEN 

combinations of x objects taken n at a time. 
EVEN(numher) returns number rounded up to 
the nearest even integer. 

EXP 

EXP(number) returns the exponential function of 
number with base e. 

FACT 

FACTDOUBLE 

FACT(numher) returns the factorial of number. 
FACTDOUBLE(numher) returns the double 
factorial of number. 

FLOOR 

FLOOR[number, significance) returns number 
rounded down to the nearest multiple of the 
significance value. 

GCD 

GCD[numberl, number2, . . .) returns the greatest 

GESTEP 

common divisor of up to 29 numbers. 

GESTEP(number, step) returns 1 if number is 
greater than or equal to step, 0 if not. 

LCM 

LCM(numberf, number2, . . .) returns the least 

INT 

LN 

common multiple of up to 29 numbers. 
INT(number) truncates number to the units place. 
LN(n umber) returns the natural logarithm of 
number. 


548 




LOG 

LOGIO 

MOD 

MULTINOMIAL 

ODD 

POWER 

PERMUT 

RAND 

ROUND 

ROUNDDOWN 

ROUNDUP 

SERIESSUM 

SIGN 

SQRT 

SQRTPI 

TRIM 

TRUNC 


LOG(number, base) returns the logarithm of number, 
with the specified (optional, default is 10) base. 
LOGlO[number) returns the common logarithm 
of number. 

MOD[number, divisor) retnrns the remainder of 
the division of number hy divisor. 
MULTINOMIAL(numberl, number2, . . .) 
retnrns the qnotient of the factorial of the snm 
of numbers and the prodnct of the factorials of 
numbers. 

□DD(numher) retnrns number ronnded np to the 
nearest odd integer. 

POWER(number, power) retnrns number raised 
to the power. 

PERMUT(x, n) retnrns the nnmher of 
permntations of x items taken n at a time. 

RANDO retnrns a randomly chosen nnmher from 
0 to hnt not inclnding 1. 

ROUND(numher, places) ronnds number to a 
certain nnmher of decimal places (if places is 
positive), or to an integer (if places is 0), or to the 
left of the decimal point (if places is negative). 
ROUNDDOWN(number, places) rounds like 
ROUND, except always toward 0. 
ROUNDUP(number, places) rounds like ROUND, 
except always away from 0. 

SERIESSUM(x, n, m, coefficients) returns the 
sum of the power series 
Uix" + a 2 X^^^ + ■ ■ ■ + where 

Uj, 02 ,... Uj are the coefficients. 

SIGN(number) returns 0, 1, or —1, the sign of 
number. 

SQRT(number) returns the square root of 
number. 

SQRTPI(number) returns the square root of 
number* TT. 

TRIM(texf) returns text with spaces removed, 
except for single spaces between words. 
TRUNC(number, digits) truncates number to an 
integer (optionally, to a number of digits). 
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Statistical Analysis 


Function Name 
CHITEST 

GROWTH 

FISHER 

FISHERINV 

FORECAST 

FTEST 

INTERCEPT 

LINEST 

LOGEST 

PROB 

PROB 

SLOPE 

STEYX 


Description 

CHITEST(o£)servec/, expected) calculates the 
Pearson chi-square for ohserved and expected 
counts. 

GROWTH(tcnoivn-y's, known-x’s, new-x’s, 
constant) returns the predicted [y) values for the 
new-x’s, based on exponential regression of the 
known-y’s on the known-x’s. 

FISHER(x] returns the value of the Fisher 
transformation evaluated at x. 

FISHERINV(y) returns the value of the inverse 
Fisher transformation evaluated at y. 
FORECAST(x, known-y’s, known-x’s) returns 
a predicted (y) value for x, based on linear 
regression of the known-y’s on the known-x’s. 
FTEST(arrayt, array2) returns the p-value of 
the one-tailed F statistic, on the basis of the 
hypothesis that the variances arrayl and array2 
are not significantly different (which is rejected 
for low p values). 

INTERCEPT(i:nown-y’s, known-x’s) returns the 
y intercept of the linear regression of known-y’s 
on known-x’s. 

LINEST(tcnown-_y’s, known-x’s, constant, stats) 
returns coefficients in the linear regression of 
known-y’s on known-x’s [constant is true if the 
intercept is forced to be 0, and stats is true if 
regression statistics are desired). 
LOGEST(tcnown-_y’s, known-x’s, constant, stats) 
returns the exponential regression of known-y’s 
on known-x’s [constant is true if the leading 
coefficient is forced to be 0, and stats is true if 
regression statistics are desired). 

PROB(x-va7ues, probabilities, value) returns 
the probability associated with value, given the 
probabilities of a range of values. 

PROB(x-values, probabilities, lower-limit, 
upper-limit) returns the probability associated 
with values between lower-limit and 
upper-limit. 

SLOPE(fcnown-y’s, known-x’s) returns the slope 
of a linear regression line. 

STEYXlknown-y’s, known-x’s) returns the 
standard error of the linear regression. 
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TREND 

TREND(icuoivu-y's, known-x’s, new-x’s, constant) 
returns the y values of given input values 
(nerv-x’s) based on the regression of known-y’s on 
known-x’s. If constant = false, the constant value 

TTEST 

is zero. 

TTEST(arrayf, array2, number-of-tails, type) 
returns the p value of a t test, of type paired (1), 
two-sample equal variance (2), or two-sample 
unequal variance (3). 

ZTEST 

ZTEST(array, x, sigma) returns the p value of a 
two-tailed z test, where x is the value to test and 
sigma is the population standard deviation. 


Trigonometric Formulas 


Function Name 

Description 

ACOS 

ACOS[number) returns the arccosine (inverse 
cosine] of number. 

ACOSH 

ACOSHinumber) returns the inverse hyperbolic 
cosine of number. 

ASIN 

ASlN[number) returns the arcsine (inverse sine] 
of number. 

ASINH 

ASINH(numher] returns the inverse hyperbolic 
sine of number. 

ATAN 

ATAN(numher] returns the arctangent (inverse 
tangent] of number. 

ATAN2 

ATAN2(x, y) returns the arctangent (inverse 
tangent] of the angle from the positive x axis. 

ATANH 

ATANH(numher] returns the inverse hyperbolic 
tangent of number. 

COS 

COSH 

COS (angle] returns the cosine of angle. 
COSH(numher] returns the hyperbolic cosine of 
number. 

DEGREES 

DEGREES (angle] returns the degree measure of 
an angle given in radians. 

PI 

RADIANS 

PI(] returns tt accurate to 15 digits. 
RADIANS(ang7e] returns the radian measure of 
an angle given in degrees. 

SIN 

SINH 

SIN(ang7e] returns the sine of angle. 
SINH(numher] returns the hyperbolic sine of 
number. 

TAN 

TANH 

TAN(ang7e] returns the tangent of angle. 
TANH(nun7her] returns the hyperbolic tangent of 
number. 
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StatPlus'^'” Commands 


StatPlus'''''^ is supplied with the textbook Data Analysis with Microsoft Excel 2007 
to perform basic statistical analysis not covered by Excel or the Analysis Tool- 
Pak. To use StatPlus, you must first verily that it is available to your workbook. 

To check whether StatPlus is available: 

1 If the StatPlus menu appears in the Menu Commands group of the 
Add-Ins tab, StatPlus is loaded and activated on your system. 

2 If the menu command does not appear, click Tools>Add-Ins from 
the menu. If the StatPlus option is listed in the Add-Ins list box, 
click the checkbox. StatPlus is now available to you. 

3 If StatPlus is not listed in the Add-Ins list box, you will have to install 
it from your instructor’s disk. See Chapter 1 for more information. 


The rest of this section documents each StatPlus Add-In command, showing 
each corresponding dialog box, and describes the command’s options and output. 


Creating Data 

Bivariate Normal Data 

The StatPlus>Create Data>Bivariate Normal command creates two columns 
of random normal data where the standard deviation a of the data in the first 
column and the correlation between the columns are specified by the user. 
The standard deviation of the data in the second column of data is a function 
of the standard deviation of the data in the first column. 
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Patterned Data 


The StatPlus>Create Data>Pattemed Data command generates a column of data 
following a specified pattern. The pattern can he created on the basis of a se¬ 
quence of numbers or taken from a number sequence entered in a data column 
already existing in the workbook. The user can specify how often each number 
in the pattern is repeated and how many times the entire sequence is repeated. 



Random Numbers 

The StatPlus>Create Data>Random Numbers command generates columns 
of random numbers for a specified probability distribution. The user speci¬ 
fies the number of samples (columns) of random numbers and the sample 
size (rows) of each sample. 
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Manipulating Columns 

Indicator Columns 

The StatPlus>Manipulate Columns>Create Indicator Columns command 
takes a column of category levels and creates columns of indicator variables, 
one for each category level in the input range. An indicator variable for a 
particular category = 1 if the row comes from an observation belonging to 
that category and 0 otherwise. 



Two-Way Table 

The StatPlus>Manipulate Columns>Create Two-Way Table command takes 
data arranged in three columns—a column of values, a column of category 
levels for one factor, and a second column of category levels for a second 
factor —and arranges the data into a two-way table. The columns of the table 
consist of the different levels of the first factor; the rows of the table consist 
of different levels of the second factor. Multiple values for each combina¬ 
tion of the two factors show up in different rows within the table. Output 
from this command can be used in the Analysis ToolPak’s ANOVA com¬ 
mands. The numbers of rows in the three columns must be equal. The user 
can choose whether to sort the row and column headers of the table. 
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Unstack Column 


The StatPlus>Manipulate Columns>Unstack Column command takes data 
found in two columns—a column of data values and a column of categories— 
and outputs the values into different columns, one column for each category 
level. The length of the values column and the length of the category column 
must he equal. The user can choose whether to sort the columns in ascending 
order of the category variable. 



Stack Columns 

The StatPlus>Manipulate Columns>Stack Columns command takes data 
that lie in separate columns and stacks the values into two columns. The 
column to the left contains the values; the column to the right is a category 
column. Values for the category are found from the header rows in the in¬ 
put columns, or, if there are no header rows, the categories are labeled as 
Level 1, Level 2, and so forth. The input range need not be contiguous. 
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Standardize Data Values 


The StatPlus>Manipulate Columns>Standardize command standardizes 
values in a collection of data columns. The user can choose from one of five 
different standardization methods. 



Sampling Data 

Conditional Sample 

The StatPlus>Sampling>Conditional Sample command extracts data values 
from a collection of columns corresponding to a specified condition. 



556 








































Periodic Sample 


The StatPlus>Sampling>Periodic Sample command samples data values 
from a collection of columns starting at a specified rov\r and then extracting 
every i th rov\r, where i is specified hy the user. 



Random Sample 

The StatPlus>Sampling>Random Sample command extracts a random 
sample of a given size from a collection of columns. The user can choose 
whether to sample with replacement or without replacement. 



Single-Variable Charts 

Boxplots 


The StatPlus>Single Variable Charts>Boxplots command creates a hoxplot. 
The data values can he arranged either as separate columns or in one col¬ 
umn with a category variable. Users can choose to add a dotted line for the 
sample average and to connect the medians between the boxes. The hoxplot 
can be sent to an embedded chart on a worksheet or to its own chart sheet. 
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Fast Scatterplot 

The StatPlus>Single Variable Charts>Fast Scatterplot command creates a 
quick scatterplot bypassing many of the commands on the Excel ribbon. The 
scatterplot can be sent to an embedded chart on a worksheet or to its own 
chart sheet. 



Histogram 


The StatPlus>Single Variable Charts>Histograms command creates a histo¬ 
gram. The user can specify a frequency, cumulative frequency, percentage, or 
cumulative percentage chart. Also, the histogram can be broken down into 
the different levels of a categorical variable. If a categorical variable is used, 
the histogram bars can be (1) stacked, (2) displayed side by side, or (3) dis¬ 
played in 3-D. The user can choose to add a normal curve to the histogram, 
as well as to display the corresponding frequency table. The histogram can 
be sent to an embedded chart on a worksheet or to its own chart sheet. 
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Stem and Leaf Plots 


The StatPlus>Single Variable Charts>Stem and Leaf Plots command cre¬ 
ates a stem and leaf plot. The data values can he arranged either as separate 
columns or in one column with a category variable. If more than one stem 
and leaf plot is generated, the user can choose to apply the same stem values 
to each of the plots and to add a summary stem and leaf plot. The user can 
also choose to truncate outliers of either moderate or major size. The stem 
and leaf plot appears as values within a worksheet. 
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Normal Probability Plot 


The StatPlus>Single Variable Charts>Normal P-plots command creates a 
normal probability plot with a table of normal scores for data in a single 
column. The normal probability plot can be sent to an embedded chart on a 
worksheet or to its own chart sheet. 








Multivariable Charts 

Fast Bubble Plot 

The StatPlus>Multivariable Charts>Fast Bubble Plot command creates a 
quick bubble plot for data arranged in different columns. The bubble plot 
can be sent to a chart sheet or embedded on a worksheet. 
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Multiple Histograms 


The StatPlus>Multivariable Charts>Multiple Histograms command creates 
stacked histogram charts. The source data can he arranged in separate col¬ 
umns or within a single column along with a column of category values. The 
user can choose to display frequencies, cumulative frequencies, percentages, 
or cumulative percentages. A normal curve can also he added to each of the 
histograms. The histograms have common hin values and are shown in the 
same vertical-axis scale. The histogram charts are sent to embedded charts 
on a worksheet. 



Multiple Normal Probability Plots 

The StatPlus>Multivariable Charts>Normal P-plots creates a collection of 
normal probability plots for variables arranged in columns. Normal curves 
are plotted on a single chart. 
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Scatterplot Matrix 

The StatPlus>Multivariable Charts>Scatterplot Matrix command creates 
a matrix of scatterplots. The scatterplots are sent to embedded charts on a 
worksheet. 








Quality-Control Charts 

C-Charts 


The StatPlus>QC Charts>C-Chart command creates a C-chart (count chart) 
of quality-control data for a single column of counts (for example, the num¬ 
ber of defects in an assembly line). The count chart includes a mean line 
and lower and upper control limits. The C-chart can be sent to an embedded 
chart on a worksheet or to its own chart sheet. 
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Individuals Chart 

The StatPlus>QC Charts>Individuals Chart command creates an Indi¬ 
viduals chart of quality-control data for a single column of quality-control 
values, where there is no subgroup available. The Individuals chart can be 
sent to an embedded chart on a worksheet or to its own chart sheet. 



P-Charts 


The StatPlus>QC Charts>P-Chart command creates a P-chart (proportion 
chart] of quality-control data. Proportion values are placed in a single col¬ 
umn. The user can specify a single sample size for all proportion values or 
can use a column of sample-size values. The P-chart can be sent to an em¬ 
bedded chart on a worksheet or to its own chart sheet. 
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Range Charts 

The StatPlus>QC Charts>Range Chart command creates a Range chart of 
quality-control data. The subgroups can he arranged in rows across separate 
columns or within a single column of data values alongside a column of 
subgroup levels. The user can use a known value of a or create the Range 
chart with an unknown a value. The Range chart can be sent to an embed¬ 
ded chart on a worksheet or to its own chart sheet. 
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Moving Range Charts 


The StatPlus>QC Charts>Moving Range Chart command creates a Moving 
Range chart of quality-control data where there is no subgroup available. The 
quality-control values must be placed in a single column. The Moving Range 
chart can be sent to an embedded chart on a worksheet or to its own chart sheet. 



XBAR Charts 

The StatPlus>QC Charts>XBAR Chart command creates an XBAR chart of 
quality-control data. The subgroups can be arranged in rows across separate 
columns or within a single column of data values alongside a column of 
subgroup levels. The user can use known values of p and cr or create the 
XBAR chart with unknown p and a values. The XBAR chart can be sent to 
an embedded chart on a worksheet or to its own chart sheet. 
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S-Charts 


The StatPlus>QC Charts>S-Chart command creates a s-chart or sigma-chart 
of quality control data. The subgroups can he arranged in rows across sepa¬ 
rate columns or within a single column of data values alongside a column 
of subgroup levels. The s-chart can be sent to an embedded chart on a work¬ 
sheet or to its own chart sheet. 



Generic QC Charts 

The StatPlus>QC Charts>Generic QC Chart command creates a generic 
quality-control chart where the user specifies the location of the lower con¬ 
trol limit (Icl), center line, and upper control limit (ucl). The chart can be 
sent to an embedded chart sheet on a worksheet or to its own chart sheet. 
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Pareto Chart 


The StatPlus>QC Charts>Pareto Chart command creates a Pareto chart of 
quality-control data. Data values can he arranged in separate columns or 
within a single column along with a column of category values. The user 
specifies conditions for a defective value. The Pareto chart can he sent to an 
embedded chart on a worksheet or to its own chart sheet. 



Descriptive Statistics 

Frequency Table 

The StatPlus>Descriptive Statistics>Frequency Table command creates a 
table containing frequency, cumulative frequency, percentage, and cumula¬ 
tive percentage. The frequency table can either be displayed by discrete val¬ 
ues in the data column or by bin values. If bin values are used, the user can 
specify how the data are counted relative to the placement of the bins. The 
frequency table can also be broken down by the values of a By variable. 
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Table Statistics 

The StatPlus>Descriptive Statistics>Table Statistics command creates a ta¬ 
ble of descriptive statistics for a two-way cross-classification table. The first 
colnmn of the table contains the titles of the descriptive statistics, the sec¬ 
ond colnmn shows their valnes, the third colnmn indicates the degrees of 
freedom, and the fonrth colnmn shows the p valne or asymptotic standard 
error. The nser mnst select the range containing the two-way table, exclud¬ 
ing the row and colnmn totals bnt including the row and colnmn headers. 
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Univariate Statistics 


The StatPlus>Descriptive Statistics>Univariate Statistics command cre¬ 
ates a table of univariate statistics. The user can choose from a selection of 
33 different statistics, either by selecting the statistics individually, or se¬ 
lecting entire groups of statistics. Statistics can be displayed in different 
columns or in different rov\rs. The table can be broken dow^n using a By 
variable. 



One Sample Tests 

One Sample t-test 

The StatPlus>One Sample Tests>l Sample f-test command performs a 
one-sample f-test and calculates a confidence interval. The data values can 
be arranged either as a single column or as two columns (in which case 
the command will analyze the paired difference between the columns). If 
two columns are used, the columns must have the same number of rows. 
Users can specify the null and alternative hypotheses as well as the size of 
the confidence interval. The output can be broken down by the levels of a 
By variable. 
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One Sample z test 

The StatPlus>One Sample Tests>l Sample z test command performs a one 
sample z test and calculates a confidence interval for data with a known 
standard deviation. The data values can he arranged either as a single col¬ 
umn or as two columns (in which case the command will analyze the paired 
difference between the columns). If two columns are used, the columns 
must have the same number of rows. Users can specify the null and alterna¬ 
tive hypotheses as well as the size of the confidence interval. The output 
can be broken down by the levels of a By variable. Users must specify the 
value of the standard deviation. 
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One Sample Sign test 


The StatPlus>One Sample Tests>l Sample Sign test command performs a 
one-sample Sign test and calcnlates a confidence interval. The data valnes 
can he arranged either as a single colnmn or as two colnmns (in which case 
the command will analyze the paired difference between the colnmns). If 
two colnmns are nsed, the colnmns mnst have the same nnmher of rows. 
Users can specify the nnll and alternative hypotheses as well as the size 
of the confidence interval. For confidence intervals, the nser specifies that 
the calcnlated interval he approximately, at least, or at most the size of the 
specified interval. The ontpnt can he broken down by the levels of a By 
variable. 



One Sample Wilcoxon Signed Rank test 

The StatPlus>One Sample Tests>l Sample Wilcoxon Signed Rank test com¬ 
mand performs a one-sample Wilcoxon Signed Rank test and calcnlates a 
confidence interval. The data valnes can be arranged either as a single col¬ 
nmn or as two colnmns (in which case the command will analyze the paired 
difference between the colnmns). If two colnmns are nsed, the colnmns 
mnst have the same nnmher of rows. Users can specify the nnll and alterna¬ 
tive hypotheses as well as the size of the confidence interval. The ontpnt 
can be broken down by the levels of a By variable. 
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Two Sample Tests 

Two Sample t-test 

The StatPlus>Two Sample Tests>2 Sample Mest command performs a two 
sample f-test for data values, arranged either in two separate columns or 
within a single column alongside a column of category levels. Users can 
specify the null and alternative hypotheses as well as the size of the con¬ 
fidence interval. The test can use either a pooled or an unpooled variance 
estimate. The output can he broken down hy the levels of a By variable. 
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Two Sample z test 


The StatPlus>Two Sample Tests>2 Sample z-test command performs a two 
sample z test for data values, arranged either in two separate columns or 
within a single column alongside a column of category levels. Users can 
specify the null and alternative hypotheses as well as the size of the con¬ 
fidence interval. Users must enter the standard deviation for each sample. 
The output can he broken down hy the levels of a By variable. 



Two Sample Mann-Whitney test 

The StatPlus>Two Sample Tests>2 Sample Mann-Whitney Rank test com¬ 
mand performs a two sample Mann-Whitney Rank test for data values, 
arranged either in two separate columns or within a single column along¬ 
side a column of category levels. Users can specify the null and alternative 
hypotheses as well as the size of the confidence interval. The output can 
be broken down by the levels of a By variable. 
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Multivariate Analyses 

Correlation Matrix 

The StatPlus>Multivariate Analysis>Correlation Matrix command creates 
a correlation matrix for data arranged in different columns. The correlation 
matrix can use either the Pearson correlation coefficient or the nonparamet- 
ric Spearman rank correlation coefficient. You can also output a matrix of 
p values for the correlation matrix. 



Means Matrix 

The StatPlus>Multivariate Analysis>Means Matrix command creates a ma¬ 
trix of pair-wise mean differences for data. The data values can he arranged in 
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separate columns or within a single column alongside a column of category 
levels. The output includes a matrix of p values with an option to adjust the 
p value for the number of comparisons using the Bonferroni correction factor. 



Time Series Analyses 

ACF Plot 


The StatPlus>Time Series>ACF Plot command creates a table of the autocorrela¬ 
tion function and a chart of the autocorrelation function, for time series data ar¬ 
ranged in a single column. The first column in the output table contains the lag 
values up to a number specified by the user, the second column contains the auto 
correlation, the third column of the table contains the lower 95% confidence 
boundary, and the fourth column contains the upper 95% confidence boundary. 
Autocorrelation values that lie outside the 95% confidence interval are shown 
in red. The chart shows the autocorrelations and the 95% confidence width. 
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Exponential Smoothing 


The StatPlus>Time Series>Exponential Smoothing command calculates one-, 
two-, or three-parameter exponential smoothing models for a single column 
of time series data. You can forecast future values of the time series on the 
basis of the smoothing model for a specified number of units and include a 
confidence interval of size specified by the user. The output includes a ta¬ 
ble of observed and forecasted values, future forecasted values, and a table 
of descriptive statistics including the mean square error and final values of 
the smoothing factors. A plot of the seasonal indexes (for three-parameter ex¬ 
ponential smoothing) is included. The exponential smoothing output is not 
dynamic and will not update if the source data in the input range change. 



Runs Test 


The StatPlus>Time Series>Runs Test command performs a Runs test on 
time series data. The test displays the number of runs, the expected num¬ 
ber of runs, and the statistical significance. The cut point can either be the 
sample mean or be specified by the user. 
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Seasonal Adjustment 

The StatPlus>Time Series>Seasonal Adjustment command creates a col¬ 
umn of seasonally adjusted time series values that show periodicity, and 
creates a plot of unadjusted and adjusted values. A plot of the seasonal 
indexes is included in the output (multiplicative seasonality is assumed). 
The seasonal adjustment output is not dynamic and will not update if the 
source data in the input range change. 



StatPlus Options 

The StatPlus>StatPlus Options command allows the user to specify the de¬ 
fault input and output options for the different StatPlus modules. One can 
also specify how StatPlus should handle hidden data. 
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General Utilities 

Resolve StatPlus Links 

The StatPlus>General Utilities>Resolve StatPlus Links command redirects 
all StatPlus links in the workbook to the current location of the StatPlus 
add-in file. 


Freeze Data in Worksheet 

The StatPlus>General Utilities>Freeze Data in Worksheet command re¬ 
moves all formulas from the active worksheet, replacing them with values. 


Freeze Hidden Data 

The StatPlus>General Utilities>Freeze Hidden Data command removes all 
formulas from the StatPlus hidden worksheet, replacing them with values. 


Freeze Data in Workbook 

The StatPlus>General Utilities>Freeze Data in Workbook command 
removes all formulas from the active workbook, replacing them with values. 
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View Hidden Data 


The StatPlus>General Utilities>View Hidden Data command unhides all of 
the hidden worksheets created hy StatPlus. 

Rehide Hidden Data 

The StatPlus>General Utilities>Rehide Hidden Data command hides all of 
the hidden worksheets created hy StatPlus. 

Remove Unlinked Hidden Data 

The StatPlus>General Utilities>Remove Unlinked Hidden Data command 
removes any hidden data created hy StatPlus that are no longer linked to a 
worksheet in the active workbook. 


Unload Modules 

The StatPlus>Unload Modules command unloads StatPlus modules. Select 
the individual modules to unload from the list of loaded modules. 



About StatPlus 

The StatPlus>About StatPlus command provides information about the instal¬ 
lation and version number of the StatPlus add-in. 
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Chart Commands 


Label Chart Points 

The StatPlus>Label Series points command can be run when a chart is the 
active object in the workbook. You can link the labels to a cell range in 
the workbook and copy the cell format. You can also replace the points 
in the scatterplot with the labels. 



Display Chart Series by Category 

The StatPlus>Display series by category command can be run when a chart 
is the active object in the workbook. The command divides the chart series 
into several different series on the basis of the levels of the category vari¬ 
able. Note that you cannot undo this command. Once the chart series is bro¬ 
ken down, it cannot be joined again. 



Select Row from Chart Series 

The StatPlus>Select Row command can be run when a chart is the active 
object in the workbook. The command selects the row in the worksheet cor¬ 
responding to the point you selected. 
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StatPlus'*'^ Math and Statistical 

Functions 

The following functions are available in Excel when StatPlus™ is loaded: 


Descriptive Statistics for One Variable 


Function Name 
COUNTBETW 


IQR 

MODEVALUE 

NSCORE 

RANGEVALUE 

RANKTIED 


RUNS 

SE 

SIGNRANK 


Description 

COUNTBETW(range, lower, upper, boundary) 
returns the count of nonblank cells in the range 
that lie between the lower and upper values. The 
boundary variable determines how the end points 
are used. 

If boundary = 1, the interval is > the lower value 
and < the upper value. 

If boundary = 2, the interval is > the lower value 
and < the upper value. 

If boundary = 3, the interval is > the lower value 
and < the upper value. 

If boundary = 4, the interval is > lower value 
and < the upper value. 

IQR(range] calculates the interquartile range for 
the data in range. 

MODEVALUE (range) calculates the mode of the data 
in range. The data are assumed to be in one column. 
NSCORE(number, range) returns the normal score 
of number (or cell reference to number) from a 
range of values. 

RANGEVALUE (range) calculates the difference 
between the maximum and minimum values from 
a range of values. 

RANKTIED(number, range, order) returns the rank 
of the number in range, adjusting the rank for ties. 
If order = 0, then the range is ranked from largest 
to smallest; if order = 1, the range is ranked from 
smallest to largest. 

RUNS(range, [center]) returns the number of runs 
in the data column range. The center = 0 unless a 
center value is entered. 

SE (range) calculates the standard error of the 
values in range. 

SIGNRANK(n umber, range) returns the sign rank 
of the number in range, adjusting the rank for ties. 
Values of zero receive a sign rank of 0. If order = 0, 
then the range is ranked from largest to smallest in 
absolute value; if order = 1, the range is ranked from 
smallest to largest in absolute value. 
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Descriptive Statistics for Two or 
More Variables 


Function Name 

Description 

CORRELP 

CORRELP(rangei, range2) returns the p value 
for the Pearson coefficient of correlation between 
rangel and range2. (Note: Range values must he 
in two columns.) 

MEDIANDIFF 

MEDIANDIFF (range, range2) calculates the 
pairwise median difference between values in 
two separate columns. 

MWMedian 

MWMedian(range, range2) calculates the median of 
the Walsh averages between two columns of data. 

MWMedianZ 

MWMedian2 (range, range2] calculates the 
median of the Walsh averages for data values 
in one column (range) with category levels in a 
second column (ranged). There can be only two 
levels in the categories column. 

PEARSONCHISQ 

PEARSONCHISQ(range) returns the Pearson 
chi-square test statistic for data in range. 

PEARSONP 

PEARSONP(range) returns the p value for the 
Pearson chi-square test statistic for data in range. 

SPEARMAN 

SPEARMAN(range) returns the Spearman 
nonparametric ranlc correlation for values in range. 
(Note: Range values must be in one column only.) 

SPEARMANP 

SPEARMANP(range) returns the p value for the 
Spearman nonparametric ranlc correlation for 
values in range. (Note: Range values must be in 
one column only.) 


Distributions 


Function Name 

Description 

NORMBETW 

NORMBETW(7ower, upper, mean, stdev) 
calculates the area under the curve between the 

TDF 

lower and upper limits for a normal distribution 
with p, = mean and a = stdev. 

TDF(numher, df, cumulative) calculates the 
area under the curve to the left of number for 
a t distribution with degrees of freedom df, if 
cumulative = true. If cumulative = false, this 
function calculates the probability density 
function for number. 
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Mathematical Formulas 


Function Name 
IF2FUNC 

IFFUNC 

RANDBERNOULLI 

RANDBETA 

RANDBINOMIAL 

RANDCHISQ 

RANDDISCRETE 

RANDEXP 

RANDF 


Description 

IF2FUNC(Fname, IFRangel, IFValuel, IFRange2, 
IFValue2, RangeAnd, [Argl, Arg2, .. .]) calculates 
the value of the Excel function Fname, for rows 
in a data set where the valnes of IFRangel are 
eqnal to IFValuel and the valnes of IFRange2 
are eqnal to IFValue2. Parameters of the 
Fname fnnction can he inserted as Argl, Arg2, 
and so forth. If RangeAnd = trne, an AND 
danse is assnmed between the two valnes. If 
RangeAnd = false, an OR danse is assnmed. 
IFFUNC(Fname, IFRange, IFValue, [Argl, Arg2,.. .]) 
calcnlates the valne of the Excel fnnction Fname, 
for rows in a data set where the valnes of IFRange 
are eqnal to IFValue. Parameters of the Fname 
function can he inserted as Argl, Arg2, and so 
forth. 

RANDBERNOULLI(proh) retnrns a random 
nnmher from the Bernonlli distrihntion with 
prohahility = prob. 

RANDBETA(a7pha, beta, [a], [h]) retnrns a 
random nnmher from the Beta distrihntion 
with parameters alpha, beta, and (optionally] 
a and b where a and b are the end points of the 
distrihntion. 

RANDBINOMIAL(proh, trials) retnrns a random 
number from the binomial distrihntion with 
probability = prob and nnmher of trials = trial. 
RANDCHISQ(c//] retnrns a random nnmher 
from the chi-sqnare distrihntion with degrees of 
freedom df. 

RANDDISCRETE (range, prob) retnrns a random 
nnmher from a discrete distrihntion where the 
valnes of the distrihntion are fonnd in the cell 
range range, and the associated probabilities are 
fonnd in the cell range prob. 

RANDEXP(7amhc/a] retnrns a random nnmher 
from the exponential distrihntion where 
A = lambda. 

RANDF(c//l, df2) retnrns a random nnmher from 
the F distrihntion with nnmerator degrees of 
freedom dfl and denominator degrees of freedom 
df2. 
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RANDGAMMA 

RANDINTEGER 

RANDLOG 

RANDNORM 

RANDPOISSON 

RANDT 

RANDUNI 


RANDGAMMA(a7pija, beta) returns a random 
number from the gamma distribution with 
parameters alpha and beta. 

RANDINTEGER(7ower, upper) returns a random 
integer from a discrete uniform distribution 
with the lower boundary = lower and the upper 
boundary = upper. 

RANDLOG(mean, stdev) returns a random 
number from the log normal distribution with 
p. = mean and cr = stdev. 

RANDNORM(mean, stdev) returns a random 
number from the normal distribution with 
p. = mean and cr = stdev. 
RANDPOISSON(7ambc/a] returns a random 
number from the Poisson distribution where 
A = lambda. 

RANDT(c//] returns a random number from the 
t distribution with degrees of freedom df. 
RANDUNI(7ower, upper) returns a random number 
from the uniform distribution where the lower 
boundary = lower and the upper boundary = 
upper. 


Statistical Analysis 


Function Name 

Description 

ACE 

AGF(raiJge, lag) calculates the autocorrelation 
function for values in range for lag = lag. Note: 
Range values must lie within one column. 

Bartlett 

Bartlett(raijge, range2, . . .) calculates the p value 
for the Bartlett test assuming that the data are 
arranged in multiple columns. 

BartlettZ 

Bartlett2 (range, range2) calculates the p value 
for the Bartlett test assuming one column of data 
values and one column of category values. 

DW 

DW(range) calculates the Durbin-Watson 
statistics for data in a single column. 

FTest2 

FTest2 (range, ranged) calculates the p value for 
the F test assuming one column of data values 
and one column of category values. 

Levene 

Levene(range, range2, . . .] calculates the p value 
for Levene test assuming that the data are 
arranged in multiple columns. 

Levene2 

Levene2 (range, range2) calculates the p value 
for the Levene test assuming one column of data 
values and one column of category values. 
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MannW 


MannWp 


MannW2 


MannWp2 


Oneway 


RUNSP 

TSTAT 


TSTATP 


MannW(raiJge, range2, [median]) calculates the 
Mann-Whitney test statistic for data valnes in 
two colnmns. The median difference is assnmed 
to he 0, nnless a median valne is specified. 
MannWp(range, range2, [median], [Alt]) 
calcnlates the p valne of the Mann-Whitney test 
statistic for data valnes in two colnmns. The 
median difference is assnmed to he 0, nnless a 
median valne is specified. The p valne is for a 
two-sided alternative hypothesis nnless Alt = 1, 
in which case a one-sided test is performed. 
MannW2(range, range2, [median]) calcnlates 
the Mann-Whitney test statistic for data valnes 
in one colnmn (range) and category valnes in a 
second colnmn (ranged). There can he only two 
levels in the categories colnmn. The median 
difference is assnmed to he 0, nnless a median 
valne is specified. 

MannWp2(range, range2, [median]) calcnlates 
the p valne of the Mann-Whitney test statistic for 
data valnes in one colnmn (range) and category 
valnes in a second colnmn (ranged). There can 
he only two levels in the categories colnmn. The 
median difference is assnmed to he 0, nnless a 
median valne is specified. The p valne is for a 
two-sided alternative hypothesis nnless Alt = 1, 
in which case a one-sided test is performed. 
□neway(range, range2) calcnlates the p valne of 
the one-way ANOVA for data arranged in two 
colnmns. 

RUNSP(range, [center]) calcnlates the p valne of 
the Rnns test for valnes in the data colnmn range. 
Center = 0 nnless a center valne is entered. 
TSTAT(range, [mean]) calcnlates the one-sample 
t-test statistic for valnes in the data colnmn range. 
The mean valne nnder the nnll hypothesis is 
assnmed to he 0, nnless a mean valne is specified. 
TSTATP(range, [mean], [Alt]) calcnlates the 
p valne for the one-sample t test statistic for valnes 
in the data colnmn range. The mean valne nnder 
the nnll hypothesis is assnmed to he 0, nnless a 
mean valne is specified. A two-sided alternative 
hypothesis is assnmed nnless Alt = —1, in which 
case the “less than” alternative hypothesis is 
assnmed, or Alt = 1, in which case the “greater 
than” alternative hypothesis is assnmed. 
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WILCOXON 


WILCOXONP 


ZSTAT 


ZSTATP 


WILCOXON(rauge, [median]) calculates the 
Wilcoxon Signed Rank statistic for valnes in the 
data colnmn range. The median valne nnder 
the nnll hypothesis is assnmed to he 0, nnless a 
median valne is specified. 

WILCOXONP(range, [median], [Alt]) calcnlates 
the p valne of the Wilcoxon Signed Rank statistic 
for valnes in the data colnmn range. The median 
valne nnder the nnll hypothesis is assnmed 
to he 0, nnless a median valne is specified. 

A two-sided alternative hypothesis is assnmed 
nnless Alt = —1, in which case the “less than” 
alternative hypothesis is assnmed, or Alt = 1, 
in which case the “greater than” alternative 
hypothesis is assnmed. 

ZSTAT(range, sigma, [mean]] calcnlates the 
z-test statistic for valnes in the data colnmn range 
with a standard deviation sigma. The mean valne 
nnder the nnll hypothesis is assnmed to he 0, 
nnless a mean valne is specified. 

ZSTATP(range, sigma, [mean], [Alt]) calcnlates 
the p valne for the z test statistic for valnes in 
the data colnmn range with a standard deviation 
sigma. The mean valne nnder the nnll hypothesis 
is assnmed to he 0, nnless a mean valne is 
specified. A two-sided alternative hypothesis 
is assnmed nnless Alt = —1, in which case the 
“less than” alternative hypothesis is assnmed, 
or Alt = 1, in which case the “greater than” 
alternative hypothesis is assnmed. 
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Index 


A 

About StatPlus command, 33 
Absolute reference, 50 
Acceptance region, 234-235 
ACF. See Autocorrelation function 
Active cell, 6 

Add-ins, 24-28. See also Analysis 
ToolPak add-ins 
tab, 9 

unloading, 30 
Additive seasonality, 464 
Advanced Filter, 56, 59-61 
Alternative hypothesis [Fl^^], 233 
Analysis of variance (ANOVA) 
Bonferroni correction factor, 
403-404 
cells in, 411 

comparing means, 402^04 
computing, 397-399 
effects model, 406, 408-409, 411 
examples of, 393^01, 413 
Excel used to perform two-way, 
419-422 

graphing data to verify, 395-397, 
414-417 

indicator variables, 406-407 
interaction plot, 417-419 
interpreting, 326-327, 399-401, 
422-424 

means model, 393, 406, 409 
one-way, 393, 406-409 
overparametrized model, 406 
regression analysis and, 325, 
326-327, 406-409 
replicates, 410 
single-factor, 523 
two-factor with replication, 524 
two-factor without replication, 525 
two-way, 410-413 
Analysis ToolPak/Data Analysis 
ToolPak, 24 

ANOVA two-way and, 419-422 
checking availability of, 522 
correlation matrix and, 343 
descriptive statistics and, 154 
effects model and, 406 


exponential smoothing, one- 
parameter command, 457 
frequency tables and, 134 
histograms and, 138 
loading, 28-29 

moving average command, 448 
percentiles and, 154 
random normal data and, 199 
regression analysis and, 323-325, 
357 

t test and, 248 
unloading, 30 
Analysis ToolPak add-ins 
ANOVA, single-factor command, 
523 

ANOVA, two-factor with 

replication command, 524 
ANOVA, two-factor without 
replication command, 525 
correlation command, 526 
covariance command, 527 
descriptive statistics command, 
527-529 

exponential smoothing command, 
457,529-530 
F test command, 530-531 
histogram command, 531-532 
moving average command, 
532-533 

output options, 522-523 
random number generation 
command, 533-534 
rank and percentile command, 

535 

regression command, 535-537 
sampling command, 537-538 
f test command, 539-542 
z test command, 542-543 
ANOVA. See Analysis of variance 
Area chart, 83 
Arguments, 47 
Attribute charts, 493 
Autocorrelation function (ACF) 
applications of, 441-443, 

444-445 

computing, 441-445 


constant variance and, 441 
formulas, 440-441 
patterns of, 443-444 
plot command, 575 
random walk model, 444 
role of, 440-441 
seasonality and use of, 470-471 
Autofill, entering data with, 37-39 
AutoFilter, 56-59 
Average. See Mean 
Axes, editing, 97-100 
Axis titles 

working with, 94-97 

B 

Balanced data, 419 
Bar charts, 84. See also Flistograms 
displaying categorical data in, 
283-285 

Bartlett’s test, 258 
Bernoulli distribution, 218 
Between-groups sum of squares, 
400 

Big Ten workbook example, 86 
Bin(s) 

counting with, 134-135 
in frequency tables, 134-136 
values, defining, 136-138 
Binomial distribution, 218 
Bivariate density function, 192 
Bivariate Normal command, 552 
Bonferroni 

correction factor, 403-404 
p values with, 342-343 
Boxplots 

command, 557-558 

comparing means with, 405-406 

creating, 171-174 

defined, 166 

fences, 167-168 

interquartile, 167 

outliers, 168 

seasonality and use of, 467-468 
whiskers, 169 
working with, 165-174 
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Bubble chart, 84 

Bubble plots, creating, 110-117 

c 

Calculated criteria, 56 
Calculated values, using, 62-63 
Categorical variables. See 
Qualitative or categorical 
variables 
Categories 

breaking histogram into, 143-146 
breaking scatter plot into, 
117-120 

grouping, 300-302 
removing from PivotTable, 
280-282 

Causality, correlation and, 336-337 
C charts, 504-506 
command, 562 
Cells, 6, 14 
active, 15 
ANOVA and, 411 
cut and paste, 17 
moving, 16-17 
range, 15, 51-53 
references, 14, 16, 50 
selecting, 14-16 
Center line, 490 
Center, measures of, 154-158 
Centering, 41-42 
Central Limit Theorem, 212-217 
Charts/Chart Wizard. See also 

Boxplots; Control charts; Pivot 
tables; specific chart types 
and axis titles, 94-97 
commands, 557-567, 580 
creating bubble plots, 110-117, 
560 

data points, identifying, 105-110 
editing charts, 91-105 
editing plot symbols, 102-105 
enlarging, 92-93 
gridlines and legends, 100-102 
introduction, 82-85 
moving to a chart sheet, 93-94 
Pareto, 513-516, 567 
resizing and moving, 91-93 
scatter plots, 86-91, 117-120 
sheets, 10, 33, 84-85 
types of, 83-84 
variables, plotting, 120-123 
XBAR, 565 
Chart sheet, 93 
Chi-square statistic, Pearson 
breaking down, 297 
defined, 293 

validity with small frequencies, 
299-302 


working with distribution, 
293-295 
Coefficient(s) 
defined, 314 
of determination, 322 
multiple regression and, 359-360 
Pearson Correlation, 336 
prediction equation and, 361-362 
Spearman’s rank correlation, 337 
t tests for, 362-363 
Collinearity, 375 
Column(s) 
chart, 83 

Create command, 554 
headings, 6 

manipulating commands, 

554-556 

Stacking command, 555 
Two-way table command, 554 
Unstacking command, 555 
Commands, running, 7-9 
Common causes, 489 
Common fields, 68 
Comparison criteria, 56 
Conclusions, drawing, 385 
Conditional Sample command, 556 
Cone chart, 84 
Confidence intervals 
calculating, 228-229 
defined, 225 
interpreting, 229-232 
Sign test and, 253-255 
z test statistic and z values, 
225-228 

Constant estimators, 205 
Constant term, 314 
Constant variance 
autocorrelation function and, 441 
in residuals, testing for, 332 
Context-sensitive ribbons, 9 
Contingency measure, 298 
Continuity adjusted chi-square, 

298, 306 

Continuous probability 
distributions, 186-187 
Continuous random variable, 189, 
190 

Continuous variables, 130, 131 
Control charts 
attributes, 493 
C-, 504-506, 562 
defined, 490 
false-alarm rate, 494 
generic QC, 566 

hypothesis testing and, 492-493 
individual, 509-512, 563 
P-, 506-509, 563 
Pareto, 513-516, 567 
range, 502-504, 564 


S-, 566 

subgroups, 493 
upper and lower control 
limits, 490 
variable, 493 
X, 493-502, 565 
XBAR, 565 

Controlled variation, 489, 490 
Correlation. See also 

Autocorrelation function 
causality and, 336-337 
command, 526 
defined, 335 

functions in Excel, 337-338 
matrix, command, 574 
matrix, creating, 338-343, 
374-375 

multiple, 359-360 
Pearson Correlation coefficient, 
336 

p values with Bonferroni, 

342- 343 

scatter plot matrix, creating, 

343- 345 
slope and, 336 
Spearman’s rank correlation 

coefficient, 337 
two-valued variable and, 342 
Covariance command, 527 
Cramer’s V, 298 
Critical values, 234 
Cut, cells, 17 

Cyclical autocorrelation, 443-444 
Cylinder chart, 84 

D 

Data 

Autofill used to enter, 37-39 
balanced, 419 
creating, 552-553 
discrimination, 377-378 
entering, 36-41 
formats, 41-45 

formulas and functions and, 45-50 
importing from databases, 68-75 
importing from text files, 63-68 
inserting new, 40-41 
paired, 244 
querying, 55-63 
Sample command, 556-557 
series, 84 

sorting, 54-55, 71-75 
Standardize command, 556 
tab, 9 

two-sample, 259-264 
Data Analysis ToolPak. See 

Analysis Toolpak/Data Analysis 
ToolPak 
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Database query, 68 
Database Query Wizard, 68-71 
Databases, importing data from, 
68-75 

Data format buttons, 42 
Data points 
identifying, 105-106 
labeling, 107-109 
selecting data row, 106-107 
Data tab, commands, 74 
Degrees of freedom, 161, 240 
ANOVA and, 401, 422-424 
F distribution and, 354 
Delimited text, 63 
Delimiter, 63 
Deming, W. Edwards, 488 
Dependent variable, 314 
Descriptive statistics 
command, 527-529 
defined, 129 

functions, list of, 544-551, 
567-569 
Develop tab, 9 
Diagnostics, 329 

Discrete probability distributions, 
185-186 

Discrete random variables, 189 
Discrete variables, 129, 130, 131 
Distribution(s). See also Probability 
distributions; Sampling 
distributions 
defined, 132 

frequency tables and, 131-138 
functions, list of, 546-548, 

582 

normal, 193-196, 331-332 
shapes, 141-143 
stem and leaf plots, 146-150 
Distribution statistics 
boxplots, 165-174 
means, medians, and mode, 
154-158 

outliers, 164-165 
percentiles and quartiles, 

151-154 

skewness and kurtosis, measures 
of shape, 162-164 
variability, measures of, 

159-161 

Doughnut chart, 84 
Durbin-Watson test statistic, 334 


E 

Editing charts, 91-105 
Effects model 
defined, 406 
fitting, 408-409 


Embedded chart objects, 84 
Embedding, 11 
chart objects, 84, 91 
Equality of variance. See variance, 
equality of 

Error sum of squares (SSE), 400, 

401 

Estimators, 205-206 
Excel, 4 
add-ins, 24 
charts, 82-85 

commands and toolbars, 7-9 

elements, 6-7 

exiting, 34 

launching, 5-6 

printing from, 18-22 

ribbon, 7 

saving, 22-24 

Solver, 479^82 

starting, 5-6 

viewing, 6-7 

window, 6-7 

Workbooks and Worksheets, 10-17 
worksheet functions, 196-205 
Exiting, 34 

Explore workbooks, 2 
Exponential smoothing 
calculating smoothed values, 458 
choosing parameter values, 
455-457 

commands, 457, 529-530, 576 
forecasting with, 450-451, 
455-457, 474^78 
Holt’s method, 458 
location parameter, 457 
one-parameter, 448-457 
optimizing, 479-482 
recursive equations, 473-474 
seasonality, 462-473 
smoothing factor/constant, 449, 
479-482 

three-parameter, 473-478 
trend parameter, 458 
two-parameter, 457^62 
Winters’ method, 473^78 
Extreme outlier, 165 


F 

Faculty, underpaid example, 
380-385 

False-alarm rate, 494 

Fast bubble plot, 560 

Fast Scatter plot command, 558 

F distribution, 353-355 

Fences, 167-168 

Fields, 68 

Files, installing, 2-3 
Fill handle, 38 


Filtering/filters 
adding, 58 

Advanced Filter, 56, 59-61 
AutoFilter, 56-59 
removing, 58 

Fitted regression line, 314, 315-316 
Fixed-width file, 63 
Fonts, changing, 41-45 
Formats, 41-45 
Formatting labels, 109 
Formula bar, 7 
Formulas 
inserting, 46-47 
linked, 32 

mathematical, list of, 548-549, 

583- 584 

statistical analysis, 550-551, 

584- 586 
tab, 9 

trigonometric, 551 
Fratio, 327 
Frequency tables 
bins in, 134-138 
command, 567-568 
creating, 132-134 
defined, 132 

validity of chi-square test with, 
299-302 
Ftest, 258 
command, 530-531 
Function(s) 
arguments, 47 

descriptive statistics, 544-546, 
567-569 

distributions, 546-548, 582 
inserting, 47-50 
Library, 48 

mathematical, 548-549, 

583- 584 
name, 47 

statistical analysis, 550-551, 

584- 586 

trigonometric, 551 
worksheet, 196-197 

G 

Generic QC chart, 566 
Geometric mean, 156, 157 
Goodman-Kruskal Gamma, 299, 

306 

Gossett, William, 240 
Gridlines, 100-102 

H 

Harmonic mean, 156, 157 
Heavy-tailed distribution, 143 
Hidden Data sheet, 33 


Index 591 





Histograms 

breaking into categories, 143-146 
commands, 531-532, 558-559, 561 
comparing, 144 

creating, 138-141, 558-559, 561 
defined,138 
of difference data, 248 
distribution shapes, 141-143 
multiple, 561 

of random sample, 199-200 
verifying ANOVA assumptions 
using, 395-397 
Holt’s method, 458 
Home tab, 8 

Homogeneity of variance, 258 
Horizontal scroll bar, 7 
Hypothesis testing, 236-239 
acceptance and rejection regions, 
234-235 

additional thoughts, 239-240 
control charts and, 492^93 
defined, 232 
elements of, 233 
example of, 234 
one-tailed, 235 
p values, 235-236 
two-tailed, 235 
types of error, 233 

I 

Importing data 
from databases, 68-75 
from text files, 63-68 
In control process, 490 
Independence of residuals, testing 
for, 332-335 
Independent, 291 
Independent variables, 314 
Indicator Columns, Create, 
command, 554 
Indicator variables, 406-407 
Individual charts, 509-512 
command, 563 
Inferential statistics, 129 
Input sheet, 33 
Inserting new data, 40-41 
Insert tab, 8 
Installing files, 2-3 
Interaction plot, 417^19 
Intercept, 314 

Interquartile range, 151, 152, 

165, 167 

K 

Kendall’s tau-b, 299, 306 
Keyboard shortcuts, 7 
Kurtosis, 162 


L 

Labeling data points, 107-109 
command, 580 
Lagged values 
calculating, 438 
defined, 438 
scatter plot of, 438-440 
Landmark summaries, 151 
Law of large numbers, 184 
LCLs. See Lower control limits 
Least squares estimates, 316 
Least squares method, 316 
Legends, 100-102 
Levene’s test, 258 

Likelihood ratio chi-square, 298, 306 
Linear model, 315 
Linear regression. See Regression 
Line chart, 84 

Line plots, seasonality and use of, 
468-470 

Location parameter, 457 
Lower control limits (LCLs), 490 

M 

MAD. See Mean absolute deviation 
Major unit, 100 
Mann-Whitney test, 265-266 
commands, 573-574 
MAPE. See Mean absolute percent 
error 

Mathematical formulas, list of, 
548-549, 583-584 
Mathematical operators, 45-46 
Mean, 154-158 
comparing, 402-404 
comparing with boxplot, 405-406 
Mean absolute deviation (MAD), 451 
Mean absolute percent error 
(MAPE), 451 
Mean Square (MS), 401 
Mean square error (MSE), 450, 
479-482 

Means matrix command, 574-575 
Means model, 393, 406, 410 
comparing, 402-404 
Measure of association, 297 
Median, 154-158 
Microsoft Query, 73 
Mixed reference, 51 
Mode, 156-157 
Moderate outlier, 165 
Modules, 30-31, 579 
Moving averages, 445-448 
command, 532-533 
Moving range charts 
command, 565 
create, 512 
defined,510 


MS. See Mean Square 
MSE. See Mean square error 
Multiple correlation, 359-360 
Multiple regression. See 
Regression, multiple 
Multiplicative seasonality, 462-463 
Multivariate analyses, commands, 
574-575 

N 

Name box, 7, 51, 52 
Names, range, 51 
Navigation buttons, 11-12 
Nominal variables, 130, 131 
Noncontiguous range, 15 
Nonparametric test 
Mann-Whitney test, 265-266 
to paired data, 250 
Sign test, 253-255 
to two-sample data, 265-266 
Wilcoxon Signed Rank test, 
250-253 

Normal distribution, 193-196 
defined, 193 
of residuals, 331-332 
Normal probability density 
function, 193-196 
difference data and, 249-250 
functions with, 196-197 
Normal probability plot, 201-205 
command, 560 
defined,201 
normal errors and, 370 
residuals and, 331-332, 378-379 
Normal score, 201 
Null hypothesis [H^], 233 

o 

Observation, 190 
Observed vs. predicted values, 
363-366 
Office button, 7 
One-parameter exponential 
smoothing, 448-457 
One-sample tests, commands, 
569-572 

One-tailed test, 235 
One-way ANOVA, 393 
regression and, 406-409 
Or condition, creating, 61 
Ordinal variables, 130, 131 
custom sort order, 307-309 
tables with, 302-309 
testing for a relationship between 
two, 303-307 

Oscillating autocorrelation, 443, 
444 

Outliers, 164-165, 168 
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Out of control process, 491 
Output options, 522-523 
Output sheet, 33 
Overparametrized model, 406 

P 

Page Layout tab, 8 
Paired data 
defined, 244 

non-parametric test applied, 
250-255 

t test applied, 244-250 
Parameters, 185, 205-206 
estimates of regression, 327-328 
location, 427 
trend, 458 
Parametric test, 250 
Pareto charts, 513-516 
command, 567 
Paste, cells, 17 

Patterned Data command, 553 
P charts, 506-509 
command, 563-564 
Pearson, Karl, 293 
Pearson chi-square statistic 
breaking down, 297 
defined, 293 

validity with small frequencies, 
299-302 

working with distribution, 
293-295 

Pearson Correlation coefficient, 336 
Percentiles, 151-154 
Period, 446 

Periodic Sample command, 557 
Phi, 298 
Pie charts, 84 
defined, 285 

displaying categorical data in, 
285-287 
Pivot tables 

changing displayed values, 
282-283 

creating, 279-280 
defined, 277 

displaying categorical data in bar 
charts, 283-285 

displaying categorical data in pie 
charts, 285-287 
inserting, 278 
removing categories from, 
280-282 

Plot symbols, 102-105 
Plotting residuals 
predicted values vs., 366-368 
predictor variables vs., 368-369 
Points, 87 


Poisson distribution, 185 
Pooled two-sample t statistic, 

255, 256 
Predicted values 
observed vs., 363-366 
plotting residuals vs., 366-368 
Prediction equation, 361-362 
Prediction, multiple regression and, 
355-356 

Predictor variables, 314 
plotting residuals vs., 368-369 
Printing 
page, 21-22 
previewing, 18-19 
setting up page for, 19-21 
Probability, defined, 183-184 
Probability density functions 
(PDFs), 186, 187-189, 

215-216 

Probability distributions 
Central Limit Theorem, 

212-217 

continuous, 186-187 
defined, 184 
discrete, 185-186 
normal, 193-196 
parameters and estimators, 
205-206 

random variables and samples, 
189-193 
Process, 488 
pth percentile, 151 
p values, 235-236 
F distribution and, 353-355 
with Bonferroni, 342-343 
Pyramid chart, 84 

Q 

Qualitative or categorical variables, 
130-131 

Quality control, statistical, 

488-490 

Quality control charts, 490-492, 
509-512 

C-, 504-506, 562-563 
commands, 562-567 
generic QC, 566 
P-, 506-509, 563-564 
Pareto, 513-516, 567 
range, 502-504, 564-565 
S-, 572 

statistical, 490 
X, 493-502, 565 
XBAR, 565 

Quantitative variables, 129, 

130, 131 

Quartiles, 151-154 


Querying data, 55-63 
database, 68 

R 

f?^-value, 322. See also coefficient, 
of determination 
Radar chart, 84 

Random autocorrelation, 443, 444 
Random normal data 
charting, 199-200 
generating, 197-199 
Random Number Generation 
command, 533-534 
Random Numbers command, 553 
Random phenomenon, 183 
Random sampling, 190-193 
command, 557 

Random variables and samples, 
189-193 

charting, 199-200 
random variable defined, 189 
using Excel to generate, 

197-199 

Random walk model, 444 
Range, measure of variability, 159 
Range charts, 502-504 
command, 564-565 
Range names, 15, 51-53 
Rank and percentile command, 

535 

Record, 68 

Recursive equations, 473^74 
References, 14, 16, 17, 50 
range, 33 
Regression 

analysis, performing, 318-328 
ANOVA and, 325, 326-327 
ANOVA one-way and, 406-409 
command, 535-537 
equation, 314-315 
exploring, 317-318 
fitted regression line, 314, 
315-316 

functions in Excel, 316-317 
interpreting analysis of variance 
table, 326-327 
model, checking, 329-335 
parameter estimates and 
statistics, 327-328 
plotting data, 320-323 
residuals, predicted values 
and, 328 

residuals, testing, 331-335 
simple linear, 314-317 
statistics, calculating, 323-325 
statistics, interpreting, 325-326 
straight-line assumption, testing, 
329-331 
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Regression, multiple, 376 
coefficients and prediction 
equation, 361-362 
example using, 371-385 
F distribution, 353-355 
multiple correlation, 359-360 
output, interpreting, 358-359, 377 
parameters, 353-356 
prediction using, 355-356 
t tests for coefficients, 362-363 
Regression assumptions, testing 
normal errors and plot, 370 
observed vs. predicted values, 
363-366 

plotting residuals vs. predicted 
values, 366-368 
plotting residuals vs. predictor 
variables, 368-369 
Rejection region, 233, 234-235 
Related, 291 

Relative frequency, 183-184 
Relative reference, 50 
Replicates, 410 
Residuals 

analysis of discrimination data, 
377-378 
defined, 314 
normal plot of, 378-379 
predicted values and, 328 
predicted values vs. plotting, 
366-368 

predictor variables vs. plotting, 
368-369 

testing for constant variance in, 332 
testing for independence of, 
332-335 

testing for normal distribution of, 
331-332 
Review tab, 9 
Ribbons 

context-sensitive, 9 
Ribbon tab, 7 
types, 8-9 
Robustness 
defined, 243 
t, 243-244 
Row headings, 7 
Runs test, 333 
command, 576-577 

s 

Sample, 190 

test commands, 569-574 
Sampling command, 537-538 
Sampling data commands, 556-557 
Sampling distributions 
creating, 206-212 


defined, 206 

standard deviation/error, 212 
Saving work, 22-24 
Scatter chart, 84 
Scatter plots 

adding moving average to, 
446-447 

breaking into categories, 117-120 
commands, 558, 562 
components of, 86-91 
defined, 87 

lagged values and, 438-440 
matrix (SPLOM), creating, 
343-345, 373-374, 562 
regression data plotting and use 
of, 320-323 

variables, plotting, 120-123 
S charts, 566 
Scroll bars, 7 
horizontal, 7 
vertical, 7, 13 
Seasonality 
additive, 464 
adjusting for, 471-^73 
autocorrelation function and, 
470-471 

boxplots and, 467-468 
command, 577 
example of, 464-473 
line plots and, 468-470 
multiplicative, 462-463 
Shapes, measures of, 162-164 
Sheet tabs, 7 
Shewhart, Walter A., 488 
Sign test, 253-255 
command, 571 
Significance level, 233 
Single-Factor command, 523 
Skewness, 141, 162 
negative, 141 
positive, 141 
Slope 

correlation and, 336 
defined,314 

Smoothing factor/constant, 449 
Solver, 479-482 
Somers’ D, 299, 306 
Sorting data, 54-55, 71-75 
custom, 307-309 
Sparse cell, 299 

SPC. See Statistical process control 
Spearman’s rank correlation 
coefficient, 337 
Special causes, 489 
SPLOM. See Scatter plots, matrix 
Spreadsheets, 4 

SQC. See Statistical quality control 
SSE. See Error sum of squares 


SST. See Sum of square for 
treatment 

Standard deviation/error, 161, 

212,451 

control limits and, 494-495, 
498-500 

Standardize (data) command, 556 
Standardized residual, 297 
Starting 
Excel, 5-6 

Statistical analysis functions, list 
of, 550-551, 584-586 
Statistical inference 
applying t test to two-sample 
data, 259-264 

confidence intervals, 225-232 
equality of variance, 258-259 
hypothesis testing, 232-235 
nonparametric test to paired data, 
250-255 

nonparametric test to two-sample 
data, 265-267 
t distribution, 240-250 
two-sample ttest, 255-257 
Statistical process control (SPC), 
488-490 

Statistical quality control (SQC), 
488-490 
StatPlus, 2 

About—command, 33, 579 
ANOVA and, 395, 397 
autocorrelation function and, 443 
boxplots and, 172-173 
checking availability of, 552 
commands, 552-580 
data points, identifying, 106 
distribution statistics and, 
162-163 

exponential smoothing and, 474 
frequency tables and, 134, 136-137 
hidden data, 31 
histograms and, 138, 143 
installing files, 2-3 
linked formulas, 32 
loading, 24-28 

Mann-Whitney test and, 265-266 
mathematical and statistical 
functions, 581-586 
modules, 30-31 
normal probability plot and, 
201-205 

Options command, 577-578 
Pareto charts and, 513-516 
percentiles and quartiles and, 
151-154 

random normal data and, 

197-199 

runs test and, 333 
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scatter plots and, 118 
seasonality and, 471 
setup options, 32-33 
Sign test and, 253-255 
table statistics, 297-299 
t test and, 245-249 
Wilcoxon Signed Rank test and, 
250-253 
Status bar, 7 

Stem and leaf plots, 146-150 
Command, 559 
Stock chart, 84 

Straight-line assumption, testing, 
329-331 

Stuart’s tau-c, 299, 306 
Subgroups, 493 

Sum of squared errors. See Error 
sum of squares 

Sum of square for treatment (SST), 
400-401,422 
Surface chart, 84 
Symbols, bubble, 113 
Symmetric distributions, 141 

T 

Tab group, 7 
Tables, 68 
commands, 554 
computing expected counts, 
291-293 

frequency, 131-138, 567-568 
ordinal variables and, 302-309 
other statistics used with, 
297-299 

Pearson chi-square statistic, 
293-297, 299-302 
pivot, 276-286 
statistics command, 567-569 
two-way, 288-291 
with ordinal variables, 303-307 
distribution, 296 
Tails, distribution, 143 
heavy, 143 
Tampering, 489 
Task bar, 5 

t confidence interval, 243 
t distribution 
defined, 240 
for coefficients, 362-363 
commands, 539-542 
construction t confidence 
interval, 243 

difference between standard 
normal and, 240, 241 
robustness 243-244 
working with, 242-243 
Test statistic, 233 


Text flies, importing data from, 63-68 
Text Import Wizard, 63-68 
Theoretical probability, 183 
Three-parameter exponential 
smoothing, 473^78 
Time series 

analysis commands, 575-577 
analyzing change, 436-438 
autocorrelation function, 

440-445 
defined, 432 
example, 432-440 
exponential smoothing, 

one-parameter, 448-457 
exponential smoothing, 

two-parameter, 457-462 
exponential smoothing, 

three-parameter, 473-478 
lagged values, 438-440 
moving averages, 445-448 
plotting percent change, 

437-438 

seasonality, 462-473 
Title, axis, 94-97 
Title bar, 6, 7 
Toolbars, 7 

Total sum of squares, 400, 422 
Treatment sum of squares, 400-401 
Trend autocorrelation, 443 
Trend parameter, 458 
Trigonometric formulas, list of, 551 
Trimmed mean, 156, 157 
t statistic, working with, 242-243 
t test 

applied to paired data, 244-250 
applied to two-sample data, 
259-264 

commands, 539-542, 569-570, 572 
for coefficients, 362-363 
Two-Factor with Replication 
command, 524 

Two-Factor without Replication 
command, 525 
Two-parameter exponential 
smoothing, 457-462 
Two-sample tests, commands, 
539-542, 572-574 
Two-sample t test 
applying, to two-sample data, 
259-264 

commands, 539-542, 572 
defined, 255 

pooled vs. unpooled, 256 
working with, 256-257 
Two-tailed test, 235 
Two-way ANOVA, 410^13 
Two-way tables, 288-291 
Create command, 554 


Type I error, 233 
Type II error, 233 

u 

UCLs. See Upper control limits 
Uncontrolled variation, 489-490 
Uniform distribution, 215-216 
Univariate statistics, 129, 162, 

163, 164 
command, 569 
Unloading 
add-ins, 30 

modules command, 579 
Unpooled two-sample t statistic, 
256 

Unstack Column command, 555 
Upper control limits (UCLs), 490 
Utilities, 578-579 

V 

Values 

using calculated, 62-63 
observed vs. predicted, 

363-366 

plotting residuals, vs. predicted, 
366-368 
Variability, 159 
measures of, 159-161 
Variable charts, 493 
Variables 

continuous, 130, 131 
correlation matrix, 374-375 
defined, 129 
dependent, 314 

descriptive statistics functions, 
544-546, 581-582 
discrete, 129, 130, 131 
independent, 314 
indicator, 406-407 
nominal, 130, 131 
ordinal, 130, 131, 302-309 
plotting, 120-123 
predictor, 314, 368-369 
qualitative or categorical, 130, 

131 

quantitative, 129, 130, 131 
random variables and samples, 
189-193 

regression equation and, 314-315 
tables with ordinal, 302-309 
Variance, 161. See also analysis of 
variance 

equality of, 258-259 
homogeneity of, 258 
one-way analysis of, 393 
Variation, controlled and 
uncontrolled, 489-490 
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Vertical scroll bar, 7 
View tab, 9 

w 

whiskers, 169 
Wilcoxon Signed Rank test, 
250-253 

command, 571-572 
Windows 
starting, 2-3 
versions of, 2 
Winters’ method, 473 
Within-groups sum of squares, 400 
Workbooks 
opening, 10-11 
scrolling through, 11-14 
Worksheets, 7, 10 


cells, 6, 14-17 
hidden, 31 

X 

X axis, 87 

adding titles, 95-96 
change scale of, 97-99 
X charts (x-bar charts), 493 
calculating, when standard 

deviation is known, 494-495 
calcnlating, when standard 
deviation is not 
known, 498-500 
command, 565 

examples of, 495-498, 500-502 
false-alarm rate, 494 
distribution, 296 


XBAR charts, 565 
XY(Scatter) chart, 84 

Y 

y axis, 87 

adding titles, 95-96 
change scale of, 99-100 

z 

Zoom controls, 7 
ztest commands, 542-543, 570, 
573 

ztest statistic, 225-228 
defined, 226 
z values, 225-228 
defined, 226 
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Quick Reference Guide 

for Berk & Carey’s Data Analysis with Microsoft® Excel: Updated for Office 2007®. 


Objective Steps Refer to 


Add-Ins, installing 

Click the Office button, click the Excel Options button, click 
Add-Ins from the list of Excel options, click the Go button, 
and click the Browse button from within the Add-Ins dialog 
box to locate and load the add-in file. 

Chapter 1 

Autocorrelation, plot 

Click StatPlus > Time Series > ACF Plot. Select the 
data column and range of lag values. Requires StstPlus. 

Chapter II 

Bivariate Normal data, 
create 

Click StatPlus > Create Data > Bivariate Normal. 

Specify the parameters of the bivariate distribution. Requires StatPlus. 


Boxplot, create 

Click StatPlus > Single Variable Charts > Boxplots. 

Select the boxplot options. Requires StatPlus. 

Chapter 4 

Bubble plots, create 

Click the Other Charts button from the Charts group on the 

Insert tab and then click a bubble chart subtype. 

Chapter 3 

C control chart, create 

Click StatPlus > QC Charts > C-Chart. Select the control 
data and specify the control chart options. Requires StatPlus. 

Chapter 12 

Chart axis, reformat 

With a chart selected, click the Axes button from the Axes group on 
the Layout tab of the ChartTools ribbon and then select whether to 
reformat the horizontal or vertical axis. 


Chart background, format 

With the chart selected, click the Plot Area button from the 
Background group on the Layout tab of the ChartTools ribbon and select 
the background option. 

Chapter 3 

Chart point labels, create 

With the chart selected, click StatPlus > Label series 
points. Requires StatPlus. 

Chapter 3 

Chart points, format 

With the chart selected, click any data point in the chart and click Chapter 3 

Format Selection from the Current Selection group on the Format 
tab of the ChartTools ribbon. 

Columns, stack 

Click StatPlus > Manipulate Columns > Stack. 

Select the columns to stack. Requires StatPlus. 


Columns, unstack 

Click StatPlus > Manipulate Columns > Unstack. 

Select the columns to unstack. Requires StatPlus. 

Chapter 10 

Correlation matrix, create 

Click StatPlus > Multivariate Analysis > Correlation 
Matrix. Enter the variables in the correlation matrix. Requires StatPlus. 

Chapter 8 

Data, enter from keyboard 

Click the cell and type the data values. 

Chapter 2 

Data, import from a 
database 

Click the Get External Data button on the Data tab; then select 
the From Other Sources button and select the data source. 

Chapter 2 

Data, import from text files 

Click the Office button, dick the Excel Options button, dick Add-Ins Chapter 2 
from the list of Excel options, dick the Go button, and dick the Browse 
button from within the Add-Ins dialog box to locate and load the add-in hie. 

Data, query with 
advanced filter 

Enter the query conditions in the worksheet, and click the 
Advanced button from the Sort & Filter group on the Data tab. 

Chapter 2 

Data, query with AutoFilter 

Select the cell range and click the Filter button from the Sort & 

Filter group on the Data tab. 

Chapter 2 

Data, sort 

Select the cell range and then click the Sort button from the Sort & 
Filter group on the Data tab. 

Chapter 2 

Frequency table, create 

Click StatPlus > Descriptive Statistics > Frequency 
Tables. Select the frequency table options. Requires StatPlus. 

Chapter 4 
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Quick Reference Guide 

for Berk & Carey’s Data Analysis with Microsoft® Excel: Updated for Office 2007®. 


Objective Steps Refer to 


Histogram, create 

Click the Data Analysis button from the Analysis group on the 
Data tab and then click Histogram. 


Histogram, create 

Click StatPlus > Single Variable Charts > 
Histograms. Select the histogram options. Requires StatPlus. 

Chapter 4 

Histograms, create multiple 

Click StatPlus > Multi-variable Charts > Multiple 
Histograms. Select the variables and enter the histogram 
options. Requires StatPlus. 

Chapter 10 

Indicator variables, create 

Click StatPlus > Manipulate Columns > Create 
Indicator Columns. Select the data column. Requires StatPlus. 

Chapter 10 

Individuals control chart, 
create 

Click StatPlus > QC Charts > Individuals Chart. 

Select the control data and specify the control chart options. 

Requires StatPlus. 

Chapter 12 

Mann-Whitney Rank test, 
perform 

Click StatPlus > Two Sample Tests > 2 Sample 
Mann-Whitney Rank test. Enter the null and alternative 
hypotheses. Requires StatPlus. 

Chapter 6 

Means matrix, create 

Click StatPlus > Multivariate Analysis > Means Matrix. Chapter 10 
Select the columns to display in the means matrix. Requires StatPlus. 

Moving average, 
add to scatterplot 

Right-click the chart series and click Add Trendline. 

Select Moving Average from the Trendline Options tab. 

Chapter II 

Moving Range control 
chart, create 

Click StatPlus > QC Charts > Moving Range Chart. 

Select the control data and specify the control chart options. 

Requires StatPlus. 

Chapter 12 

Multiple regression analysis, 
perform 

Click the Data Analysis group from the Analysis group on the Data Chapter 9 
tab and click Regression. Requires the Analysis ToolPak. 

Normal probability plot, 
create 

Click StatPlus > Single Variable Charts > 

Normal P-plots. Requires StatPlus. 

Chapter 5 

One-sample Sign test, 
perform 

Click StatPlus > One Sample Tests > 1 Sample 

Sign test. Enter the null and alternative hypotheses. 

Requires StatPlus. 


One-sample t test, perform 

Click StatPlus > One Sample Tests > 1 Sample 
t-test. Enter the null and alternative hypotheses. 

Requires StatPlus. 

Chapter 6 

One-sample z test, perform 

Click StatPlus > One Sample Tests > 1 Sample 
z-test. Enter the null and alternative hypotheses. Requires StatPlus. 


One-parameter exponential 
smoothing, perform 

Click StatPlus > Time Series > Exponential 
Smoothing. Select the options for the one-parameter model. 
Requires StatPlus. 

Chapter II 

One-way analysis of variance, 
perform 

Click the Data Analysis group from the Analysis group 
on the Data tab and click ANOVA: Single Factor. 

Requires the Analysis ToolPak. 

Chapter 10 

P control chart, create 

Click StatPlus > QC Charts > P-Chart. Select the control 
data and specify the control chart options. Requires StatPlus. 

Chapter 12 

Paired t test, perform 

Click StatPlus > One Sample Tests > 1 Sample 
t-test. Enter the null and alternative hypotheses. Requires StatPlus. 

Chapter 6 

Pareto chart, create 

Click StatPlus > QC Charts > Pareto Chart. Select the 
control data and specify the options for the Pareto chart. 

Requires StatPlus. 

Chapter 12 
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Objective Steps Refer to 


Patterned data, create 

Click StatPlus > Create Data > Patterned Data. 

Specify the data pattern. Hequires StatPlus. 

Chapter 7 

PivotTable, create 

Click the PivotTable button from the Tables group on the Insert tab. 

PivotTable, 

grouping categories in 

Select cells from the PivotTable and then dick the Group 

Selection button from the Group group on the Options tab of the 
PivotTable Tools ribbon. 

Chapter 7 

PivotTable, 

remove categories from 

Click and drag the row or column label off of 
the pivot table. 

Chapter 7 

Random numbers, create 

Click the Data Analysis button from the Analysis group on the 

Data tab and then click Random Number Generation. 


Random numbers, create 

Click StatPlus > Create Data > Random Numbers. 

Select the probability distribution, number of samples, and sample 
size. Requires StatPlus. 

Chapter 5 

Range control chart, 
create 

Click StatPlus > QC Charts > Range Chart. Select the 
control data and specify the control chart options. Requires StatPlus. 

Chapter 12 

Range names, 

create from column labels 

Select the cell range and then click the Create from 

Selection button from the Defined Names group on the Formulas 
tab; then click the Top Row check box and click OK. 

Chapter 2 

Regression analysis, 
perform 

Click the Data Analysis button from the Analysis group on 
the Data tab and then click Regression. Requires the 

Analysis ToolPak 

Chapter 8 

Regression line, 
add to scatterplot 

Right-click the chart series and click Add Trendline. Select the 
regression type from the Trendline Options dialog sheet. 

Chapter 8 

Runs test, perform 

Click StatPlus > Time Series > Runs test. Enter the 
options of the test. Requires StatPlus. 

Chapter 8 

Sample, 

create a conditional 

Click StatPlus > Sampling > Conditional Sample. 

Enter the sampling conditions. Requires StatPlus. 


Sample, create a periodic 

Click StatPlus > Sampling > Periodic Sample. Enter 
the sampling conditions. Requires StatPlus. 


Sample, create a random 

Click StatPlus > Sampling > Random Sample. Enter 
the sampling conditions. Requires StatPlus. 


Scatterplot matrix, create 

Click StatPlus > Multi-variable Charts > 

Scatterplot Matrix. Enter the variables in the scatterplot 
matrix. Requires StatPlus. 

Chapter 8 

Scatterplot, 
break into categories 

Select the chart and click StatPlus > Display by Category. 
Specify the categorical variable to use. Requires StatPlus. 

Chapter 3 

Scatterplot, create quickly 

Click StatPlus > Single Variable Charts > Fast 
Scatterplot. Enter the data columns for the x and y axes. 

Requires StatPlus. 

Chapter 11 

Seasonal adjustment, 
perform 

Click StatPlus > Time Series > Seasonal Adjustment. 

Select the data column and the period of the season. Requires StatPlus. 

Chapter 11 

Standardize data 

Click StatPlus > Manipulate Columns > Standardize. Enter the 
data columns and select the method of standardization. Requires StatPlus. 

StatPlus modules, unloading 

Click StatPlus > Unload Modules, select the module’s 
checkbox and dick OK. Requires StatPlus. 

Chapter 1 
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Objective Steps Refer to 


StatPlus, hidden data viewing 

Click StatPlus > General Utilities > View Hidden 
Data. Hequires StatPlus. 

Chapter 1 

StatPlus, installing 

Access the online installation program from the website and follow the 
instructions on the Installation Wizard. 


StatPlus, set options 

Click StatPlus > StatPlus Options. Hequires StatPlus. 

Chapter 1 

StatPlus, update links 

Click StatPlus > General Utilities > Resolve 

StatPlus Links. 

Chapter 1 

Stem and Leaf plot, create 

Click StatPlus > Single Variable Charts > Stem 
and Leaf Select the Stem and Leaf options. Pequires StatPlus. 

Chapter 4 

Table statistics, calculate 

Click StatPlus > Descriptive Statistics > Table 
Statistics. Select the cell range containing the cell counts and 
table labels but not the column and row totals. Pequires StatPlus. 

Chapter 7 

Three-parameter exponential 
smoothing, perform 

Click StatPlus > Time Series > Exponential 
Smoothing. Select the options for the three-parameter model. 
Pequires StatPlus. 

Chapter II 

Two-sample t test, perform 

Click StatPlus > Two Sample Tests > 2 Sample 
t-test. Enter the null and alternative hypotheses. 

Pequires StatPlus. 

Chapter 6 

Two-sample z test, perform 

Click StatPlus > Two Sample Tests > 2 Sample 
z-test. Enter the null and alternative hypotheses. 

Pequires StatPlus. 


Two-parameter exponential 
smoothing, perform 

Click StatPlus > Time Series > Exponential 
Smoothing. Select the options for the two-parameter model. 
Pequires StatPlus. 

Chapter 11 

Two-way analysis of variance 
with replication, perform 

Click the Data Analysis button from the Analysis Group on the 

Data tab and click ANOVA: Two-Factor with Replication. 

Pequires the Analysis ToolPak. 

Chapter 10 

Two-way analysis of variance 
without replication, perform 

Click the Data Analysis button from the Analysis group on the Data 
tab and click ANOVA: Two-Factor without Replication. 

Pequires the Analysis ToolPak. 

Two-way table, create 

Click StatPlus > Manipulate Columns > Create Chapter 10 

Two-Way Table. Select the columns for the table. Pequires StatPlus. 

Univariate statistics, display 

Click the Data Analysis button from the Analysis group on the Data tab 
and click Descriptive Statistics. Pequires the Analysis ToolPak. 

Univariate statistics, display 

Click StatPlus > Descriptive Statistics > Univariate 
Statistics. Select the statistia to display. Pequires StatPlus. 

Chapter 4 

Unpaired t test, perform 

Click StatPlus > Two Sample Tests > 2 Sample 
t-test. Enter the null and alternative hypotheses. Pequires StatPlus. 

Chapter 6 

Wilcoxon Signed Rank test, 
perform 

Click StatPlus > One Sample Tests > 1 Sample 
Wilcoxon Signed Rank test. Enter the null and alternative 
hypotheses. Pequires StatPlus. 

Chapter 6 

XBar control chart, create 

Click StatPlus > QC Charts > XBar Chart. Select the 
control data and specify the control chart options. Pequires StatPlus. 

Chapter 12 





















